.. _pipeline_module: Full pipeline (ProQSAR) ======================= The ``proqsar.qsar.ProQSAR`` class provides a single-call, opinionated end-to-end QSAR workflow that chains the core modules (standardization, featurization, preprocessing, splitting, feature selection, model development, hyperparameter optimisation and evaluation) into a reproducible experiment. This example demonstrates a reproducible run: training + optimisation followed by inference with prediction intervals and applicability-domain flags. Training & optimisation (reproducible) -------------------------------------- .. code-block:: python import pandas as pd from proqsar.qsar import ProQSAR from proqsar.Config.config import Config # small reproducible demo dataset (public) url = "https://raw.githubusercontent.com/Medicine-Artificial-Intelligence/ProQSAR/main/Data/testcase.csv" data = pd.read_csv(url).iloc[:50, :] # use small sample for quick demo data['id'] = data.index # centralised config: change splitter / optimizer settings here cfgs = Config( splitter={'option': 'scaffold'}, optimizer={'n_trials': 50} ) # create pipeline — set random_state for reproducibility pipeline = ProQSAR( activity_col='pChEMBL', id_col='id', smiles_col='Smiles', n_jobs=4, project_name='Demo', scoring_target='r2', n_splits=5, n_repeats=5, config=cfgs, random_state=42 ) # run the full training + optimisation pipeline (alpha used for internal statistical tests) result = pipeline.run_all(pd.DataFrame(data), alpha=0.05) Notes: - For full reproducibility ensure: fixed ``random_state``; deterministic platform (same Python/RDKit/NumPy versions); and a fixed Optuna seed when running long studies (set in Config/optimizer if supported). - Use a small ``n_trials`` and CV folds for quick debugging; increase for production. Pipeline summary (object repr) ------------------------------ After successful training the pipeline prints a concise summary. Example: .. code-block:: text ┌────────────────────────────────────────────────────────────────────┐ │ ProQSAR Pipeline │ ├────────────────────────────────────────────────────────────────────┤ │ Project: Demo │ │ Save Dir: Project/Demo │ │ Selected feature: 'RDK5' │ │ Fitted: True │ │ Models registered: 1 │ │ Selected model: 'XGBRegressor' │ │ CV (XGBRegressor): 0.683 ± 0.298 │ │ n_jobs: 4 │ │ scoring_target: 'r2' │ │ Optimizer: enabled ConfPred: enabled AD: enabled │ └────────────────────────────────────────────────────────────────────┘ Key fields: - ``Selected feature`` — name of the selected feature set (e.g. RDK5). - ``Selected model`` — model family chosen after benchmarking. - ``CV (...)`` — cross-validated score ± std (scoring_target). - ``Optimizer``, ``ConfPred``, ``AD`` — flags for hyperparameter search, conformal prediction, and applicability-domain checks. Inference (predict using :class:`Inference`) ------------------------------------------- Use the :class:`proqsar.infer.Inference` wrapper to run predictions with the same output format returned by :meth:`pipeline.predict` (point predictions, conformal prediction intervals and applicability-domain flags). The wrapper additionally stores metadata about the last run and prints a compact summary when the :class:`Inference` object is printed. .. code-block:: python import pandas as pd from proqsar.infer import Inference # load test / new set url = "https://raw.githubusercontent.com/Medicine-Artificial-Intelligence/ProQSAR/main/Data/testcase.csv" test = pd.read_csv(url) # optional: create id column if you prefer an explicit id in output test["id"] = test.index # wrap the trained pipeline (inplace controls whether input DF may be modified) infer = Inference(pipeline, inplace=True) # run inference (id_key=None => index will be used; ground_truth optional) preds_df = infer.run( test, smiles_key="Smiles", id_key=None, # None means infer will pass-through the index as `id` ground_truth="pChEMBL", # provide if you want observed activity in output alpha=0.05 # 95% prediction intervals ) # preds_df is the same-format table produced by pipeline.predict(...) print(preds_df.head()) Example output (first rows) --------------------------- .. code-block:: text id pChEMBL Predicted value Prediction Interval (alpha=0.05) Applicability domain 0 7.698970 6.720965 [4.429, 8.584] in 1 6.576754 7.760520 [5.270, 9.386] in 2 5.970000 6.018114 [4.426, 8.542] in 3 5.602060 5.681627 [3.624, 7.740] in 4 5.397940 5.718508 [3.785, 7.902] in ... 49 5.761954 5.681627 [3.624, 7.740] in Column explanations ------------------- - ``id`` — input sample identifier (passed-through). When ``id_key=None``, the DataFrame index is used and passed through as ``id``. - ``pChEMBL`` — observed activity if present (passed-through / used for evaluation). - ``Predicted value`` — model point prediction (mean/median depending on estimator/wrapping). - ``Prediction Interval (alpha=0.05)`` — conformal prediction interval for the chosen ``alpha``. - ``Applicability domain`` — in/out flag indicating whether the sample lies within the model's AD. Printing the Inference object (compact summary) ----------------------------------------------- After running :meth:`Inference.run`, printing the :class:`Inference` object displays a compact summary for the last run (row count, AD split, prediction statistics and quantiles). Example: .. code-block:: text ┌────────────────────────────────────────────────────────────────────────┐ │ Inference (ProQSAR) │ ├────────────────────────────────────────────────────────────────────────┤ │ Project: Demo │ │ Save Dir: Project/Demo │ │ Selected feature: 'RDK5' │ │ Last run (rows): 50 │ │ Applicability domain column: Applicability domain │ │ AD: in=50 (100.00%) out=0 │ │ Predictions — mean: 5.978 std: 1.208 nan%: 0.00% │ │ Quantiles (10/50/90): 4.619 / 5.730 / 7.778 │ └────────────────────────────────────────────────────────────────────────┘ Notes ----- - ``inplace``: when ``True`` the input DataFrame may be modified in-place; set ``inplace=False`` to preserve the original. - ``alpha``: conformal prediction level (e.g. ``0.05`` → 95% PI). - If you encounter a ``KeyError`` for ``smiles_key`` or ``id_key``, verify the input DataFrame column names and pass the correct keys to :meth:`Inference.run`. Reproducibility checklist ------------------------- - Fix ``random_state`` in the pipeline and any downstream components that accept a seed. - Pin environment versions (Python, RDKit, scikit-learn, xgboost/optuna) - Save artifacts from runs (``pipeline.save_dir``) — the folder includes model, CV results, Optuna study and plots. - When comparing runs, keep ``alpha`` / CV settings / optimizer budget identical. Troubleshooting --------------- - If predictions are unexpectedly constant or many samples are marked ``out`` in AD, inspect: - preprocessing logs (duplicates / missing / low-variance steps), - feature generation (are features identical?), - applicability-domain thresholds / distance metric settings. - If Optuna optimisation produces noisy outcomes, increase ``n_trials`` and/or use a deterministic sampler/seed.