API

Data

Standardizer

class proqsar.Data.Standardizer.SMILESStandardizer(smiles_col: str = 'SMILES', normalize: bool = True, tautomerize: bool = True, remove_salts: bool = False, handle_charges: bool = False, uncharge: bool = False, handle_stereo: bool = True, remove_fragments: bool = False, largest_fragment_only: bool = False, n_jobs: int = 1, deactivate: bool = False)

Bases: BaseEstimator

Class for comprehensive standardization of chemical structures represented in SMILES format.

This class provides a configurable pipeline of standardization steps using RDKit. It ensures molecules are normalized, tautomers canonicalized, salts removed, charges adjusted, stereochemistry assigned, fragments filtered, and hydrogen handling is consistent. The standardized molecules are output as both RDKit Mol objects and SMILES strings, suitable for downstream cheminformatics workflows.

Methods

smiles2mol(smiles: str) -> Optional[Chem.Mol]

Convert a SMILES string to an RDKit Mol object.

standardize_mol(mol: Chem.Mol) -> Optional[Chem.Mol]

Apply all configured standardization steps to an RDKit Mol object.

standardize_smiles(smiles: str) -> Tuple[Optional[str], Optional[Chem.Mol]]

Convert and standardize a SMILES string, returning both the canonical SMILES and RDKit Mol object.

standardize_dict_smiles(data_input: Union[pd.DataFrame, List[dict]])

Apply standardization to all SMILES in a DataFrame or list of dicts.

param smiles_col:

Column/key name containing SMILES strings in the input data.

type smiles_col:

str, optional

param normalize:

If True, normalize molecules (aromaticity, functional groups, etc.).

type normalize:

bool, optional

param tautomerize:

If True, canonicalize tautomers into a single representation.

type tautomerize:

bool, optional

param remove_salts:

If True, strip counter-ions and salt fragments.

type remove_salts:

bool, optional

param handle_charges:

If True, reionize charges to standard protonation states.

type handle_charges:

bool, optional

param uncharge:

If True, remove charges by neutralizing charged species.

type uncharge:

bool, optional

param handle_stereo:

If True, assign or clean stereochemistry information.

type handle_stereo:

bool, optional

param remove_fragments:

If True, discard extra fragments (e.g., keep only parent molecule).

type remove_fragments:

bool, optional

param largest_fragment_only:

If True, retain only the largest connected fragment.

type largest_fragment_only:

bool, optional

param n_jobs:

Number of parallel jobs to use when standardizing a batch of molecules.

type n_jobs:

int, optional

param deactivate:

If True, disable all standardization steps (useful for debugging).

type deactivate:

bool, optional

static smiles2mol(smiles: str) Mol | None

Convert a SMILES string to RDKit Mol object.

Parameters:

smiles (str) – SMILES string to be converted.

Returns:

RDKit Mol object or None if conversion fails.

Return type:

Optional[Chem.Mol]

standardize_dict_smiles(data_input: DataFrame | List[dict]) DataFrame | List[dict]

Standardize SMILES strings within a pandas DataFrame or a list of dictionaries using parallel processing.

Parameters:

data_input (Union[pandas.DataFrame, List[dict]]) – Data containing SMILES strings to be standardized. Can be a pandas DataFrame or list of dicts.

Returns:

Input data with additional standardized SMILES and Mol columns/keys.

Return type:

Union[pandas.DataFrame, List[dict]]

Raises:
  • TypeError – If input is not a DataFrame or list of dictionaries.

  • Exception – Any unexpected exception encountered during processing.

standardize_mol(mol: Mol) Mol | None

Standardize an RDKit Mol object using various chemical standardization steps.

Parameters:

mol (Chem.Mol) – The molecule to be standardized.

Returns:

The standardized molecule or None if it cannot be processed.

Return type:

Optional[Chem.Mol]

Raises:

ValueError – If the input molecule is None.

standardize_smiles(smiles: str) Tuple[str | None, Mol | None]

Convert a SMILES string to a standardized RDKit Mol object and return both standardized SMILES and Mol.

Parameters:

smiles (str) – The SMILES string to be standardized.

Returns:

Tuple containing the standardized SMILES string and Mol object, or (None, None) if unsuccessful.

Return type:

Tuple[Optional[str], Optional[Chem.Mol]]

proqsar.Data.Standardizer.assign_stereochemistry(mol: Mol, cleanIt: bool = True, force: bool = True) None

Assign stereochemistry to a molecule using RDKit’s AssignStereochemistry.

Parameters:
  • mol (Chem.Mol) – The RDKit molecule object.

  • cleanIt (bool, optional) – Whether to clean the molecule before assignment. Default is True.

  • force (bool, optional) – Whether to force stereochemistry assignment. Default is True.

Returns:

None

Return type:

None

proqsar.Data.Standardizer.canonicalize_tautomer(mol: Mol) Mol

Canonicalize the tautomer of a molecule using RDKit’s TautomerCanonicalizer.

Parameters:

mol (Chem.Mol) – The RDKit molecule object.

Returns:

The molecule object with canonicalized tautomer.

Return type:

Chem.Mol

>>> mol = Chem.MolFromSmiles("O=C1NC=CC1=O")
>>> canonicalized = canonicalize_tautomer(mol)
proqsar.Data.Standardizer.fragments_remover(mol: Mol) Mol | None

Remove small fragments from a molecule, keeping only the largest one.

Parameters:

mol (Chem.Mol) – The RDKit molecule object.

Returns:

The molecule object with only the largest fragment kept, or None if fragment removal fails.

Return type:

Optional[Chem.Mol]

>>> mol = Chem.MolFromSmiles("CCC.CCCO")
>>> largest = fragments_remover(mol)
proqsar.Data.Standardizer.normalize_molecule(mol: Mol) Mol

Normalize a molecule using RDKit’s Normalizer to correct functional groups and recharges.

Parameters:

mol (Chem.Mol) – The RDKit molecule object to be normalized.

Returns:

The normalized RDKit molecule object.

Return type:

Chem.Mol

>>> mol = Chem.MolFromSmiles("CC(=O)O")
>>> normalized = normalize_molecule(mol)
proqsar.Data.Standardizer.reionize_charges(mol: Mol) Mol

Adjust a molecule to its most likely ionic state using RDKit’s Reionizer.

Parameters:

mol (Chem.Mol) – The RDKit molecule object.

Returns:

The molecule object with reionized charges.

Return type:

Chem.Mol

>>> mol = Chem.MolFromSmiles("CC[NH3+]")
>>> reionized = reionize_charges(mol)
proqsar.Data.Standardizer.remove_hydrogens_and_sanitize(mol: Mol) Mol | None

Remove explicit hydrogens and sanitize a molecule.

Parameters:

mol (Chem.Mol) – The RDKit molecule object.

Returns:

The molecule object with explicit hydrogens removed and sanitized, or None if sanitization fails.

Return type:

Optional[Chem.Mol]

>>> mol = Chem.MolFromSmiles("CCO")
>>> clean_mol = remove_hydrogens_and_sanitize(mol)
proqsar.Data.Standardizer.salts_remover(mol: Mol) Mol

Remove salt fragments from a molecule using RDKit’s SaltRemover.

Parameters:

mol (Chem.Mol) – The RDKit molecule object.

Returns:

The molecule object with salts removed.

Return type:

Chem.Mol

>>> mol = Chem.MolFromSmiles("CCO.Na")
>>> desalted = salts_remover(mol)
proqsar.Data.Standardizer.uncharge_molecule(mol: Mol) Mol

Neutralize a molecule by removing charges using RDKit’s Uncharger.

Parameters:

mol (Chem.Mol) – The RDKit molecule object.

Returns:

The neutralized molecule object.

Return type:

Chem.Mol

>>> mol = Chem.MolFromSmiles("CC[NH3+].[Cl-]")
>>> uncharged = uncharge_molecule(mol)

Featurizer

Splitter

class proqsar.Data.Splitter.data_splitter.Splitter(activity_col: str = 'activity', smiles_col: str = 'SMILES', mol_col: str = 'mol', option: str = 'random', test_size: float = 0.2, n_splits: int = 5, cutoff: float = 0.35, random_state: int = 42, save_dir: str | None = 'Project/Splitter', data_name: str | None = None, deactivate: bool = False)

Bases: BaseEstimator

Unified interface for dataset partitioning into train/test subsets.

This class provides a common interface to perform different dataset splitting strategies, such as random splitting, stratified random splitting, scaffold-based splitting, and scaffold-based stratified splitting. It also handles optional saving of train/test splits to disk.

Parameters:
  • activity_col (str) – Name of the column representing the activity or target label. Default is "activity".

  • smiles_col (str) – Name of the column containing SMILES strings for molecular data. Default is "SMILES".

  • mol_col (str) – Name of the column containing RDKit Mol objects (if available). Default is "mol".

  • option (str) – Splitting method, one of "random", "stratified_random", "scaffold", "random_scaffold", "stratified_scaffold", "butina". Default is "random".

  • test_size (float) – Proportion of the dataset to include in the test split. Default is 0.2.

  • n_splits (int) – Number of folds for stratified splitting (used only when option is "stratified_scaffold"). Default is 5.

  • cutoff (float) – Tanimoto distance cutoff used in "butina" splitting. Lower values produce smaller, finer clusters (stricter similarity), while higher values produce larger, coarser clusters. Only used when option="butina". Default is 0.6.

  • random_state (int) – Random seed used by the random number generator. Default is 42.

  • save_dir (Optional[str]) – Directory where train/test CSV files are saved. If None, no files are written. Default is "Project/Splitter".

  • data_name (Optional[str]) – Optional name suffix for saved train/test files. Default is None.

  • deactivate (bool) – If True, disables splitting and returns the full dataset as training set. Default is False.

Example

import pandas as pd
from proqsar.Data.Splitter.splitter import Splitter

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "SMILES": ["CCO", "CCC", "CCN", "CCCl", "CCBr"],
    "activity": [1.2, 3.4, 2.1, 0.5, 4.7]
})

splitter = Splitter(option="random", test_size=0.4, random_state=0)
train, test = splitter.fit(df)
print(train.shape, test.shape)  # (3, 2) (2, 2)
fit(data: DataFrame | List[Dict]) Tuple[DataFrame, DataFrame | None]

Split the dataset into training and testing sets.

Parameters:

data (pd.DataFrame) – Input dataset containing at least the activity column and SMILES column.

Returns:

A tuple (train_df, test_df) where - train_df is the training dataset (with SMILES and Mol columns dropped). - test_df is the testing dataset (with SMILES and Mol columns dropped),

or None if deactivate=True.

Return type:

Tuple[pd.DataFrame, Optional[pd.DataFrame]]

Raises:
  • ValueError – If an invalid splitting option is provided.

  • Exception – If an unexpected error occurs during splitting.

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') Splitter

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

Preprocessor

Data Cleaning

class proqsar.Preprocessor.Clean.duplicate_handler.DuplicateHandler(activity_col: str | None = None, id_col: str | None = None, cols: bool = True, rows: bool = True, keep: str = 'mean', random_state: int | None = 42, save_method: bool = False, save_dir: str = 'Project/DuplicateHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)

Bases: BaseEstimator, TransformerMixin

A preprocessing transformer to detect and remove duplicate columns and rows in a pandas DataFrame.

  • Duplicate columns are removed (if cols=True) using exact column equality.

  • Duplicate rows are consolidated (if rows=True) based on all feature columns (i.e., all columns except id_col, activity_col, and any removed dup columns). The consolidation strategy for the activity column is controlled by keep:

    • ‘first’ : keep the first occurrence as-is

    • ‘last’ : keep the last occurrence as-is

    • ‘random’ : keep a random occurrence (requires random_state for determinism)

    • ‘min’ : keep the row with minimum activity

    • ‘max’ : keep the row with maximum activity

    • ‘mean’ : collapse duplicates and set activity to the mean

    • ‘median’ : collapse duplicates and set activity to the median

    For ‘mean’ / ‘median’, the first row of the group is retained and its activity value is replaced by the aggregated statistic.

Supports saving the fitted handler and transformed data for reproducibility.

fit(data: DataFrame, y=None) DuplicateHandler

Fit the handler by identifying duplicate columns.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame to inspect for duplicate columns.

  • y (Optional[pandas.Series]) – Ignored. Present for sklearn compatibility.

Returns:

The fitted DuplicateHandler instance.

Return type:

DuplicateHandler

Raises:

Exception – If an unexpected error occurs during fitting.

fit_transform(data: DataFrame, y=None) DataFrame

Fit the handler and then transform the data.

Parameters:
Returns:

Transformed DataFrame with duplicates removed.

Return type:

pandas.DataFrame

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') DuplicateHandler

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') DuplicateHandler

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

transform(data: DataFrame) DataFrame

Transform the DataFrame by removing duplicate rows and columns.

Parameters:

data (pandas.DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with duplicates removed.

Return type:

pandas.DataFrame

Raises:
  • ValueError – If a required column is missing.

  • Exception – For any unexpected error during transformation.

class proqsar.Preprocessor.Clean.low_variance_handler.LowVarianceHandler(activity_col: str | None = None, id_col: str | None = None, var_thresh: float = 0.05, save_method: bool = False, visualize: bool = False, save_image: bool = False, image_name: str = 'variance_analysis', save_dir: str = 'Project/LowVarianceHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)

Bases: BaseEstimator, TransformerMixin

Preprocessing transformer that removes low-variance features.

Features with variance below a specified threshold are dropped. Supports visualization, saving fitted objects, and saving transformed data.

fit(data: DataFrame, y=None) LowVarianceHandler

Fit the handler by determining which features exceed the variance threshold.

Parameters:
Returns:

The fitted LowVarianceHandler instance.

Return type:

LowVarianceHandler

fit_transform(data: DataFrame, y=None) DataFrame

Fit the handler and transform the data.

Parameters:
Returns:

Transformed DataFrame with selected features retained.

Return type:

pandas.DataFrame

static select_features_by_variance(data: DataFrame, activity_col: str | None = None, id_col: str | None = None, var_thresh: float = 0.05) list

Select features that pass the variance threshold.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • activity_col (Optional[str]) – Activity column to exclude from selection.

  • id_col (Optional[str]) – ID column to exclude from selection.

  • var_thresh (float) – Minimum variance required to retain a feature.

Returns:

List of selected feature names.

Return type:

list

Raises:

Exception – If variance selection fails.

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') LowVarianceHandler

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') LowVarianceHandler

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

transform(data: DataFrame) DataFrame

Transform the data by keeping only selected features.

Parameters:

data (pandas.DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with only retained features.

Return type:

pandas.DataFrame

Raises:
  • NotFittedError – If called before fit.

  • Exception – For unexpected errors during transformation.

static variance_threshold_analysis(data: DataFrame, activity_col: str | None = None, id_col: str | None = None, set_style: str = 'whitegrid', save_image: bool = False, image_name: str = 'variance_analysis', save_dir: str = 'Project/VarianceHandler') None

Perform variance threshold analysis on non-binary features and plot retained feature counts as threshold increases.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • activity_col (Optional[str]) – Activity column to exclude from analysis.

  • id_col (Optional[str]) – ID column to exclude from analysis.

  • set_style (str) – Seaborn plot style (default “whitegrid”).

  • save_image (bool) – Whether to save the plot as an image.

  • image_name (str) – Base filename for saved image.

  • save_dir (str) – Directory to save plot if save_image=True.

Returns:

None

Return type:

None

Raises:

Exception – If variance analysis fails.

class proqsar.Preprocessor.Clean.missing_handler.MissingHandler(activity_col: str | None = None, id_col: str | None = None, missing_thresh: float = 40.0, imputation_strategy: str = 'mean', n_neighbors: int = 5, save_method: bool = False, save_dir: str | None = 'Project/MissingHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)

Bases: BaseEstimator, TransformerMixin

Handle missing values by:
  • dropping columns with too many missing values,

  • imputing binary and non-binary columns separately,

  • supporting multiple imputation strategies.

Supports saving fitted imputers and transformed data for reproducibility.

static calculate_missing_percent(data: DataFrame) DataFrame

Compute percentage of missing values per column.

Parameters:

data (pandas.DataFrame) – Input DataFrame.

Returns:

DataFrame with columns [“ColumnName”,”MissingPercent”].

Return type:

pandas.DataFrame

fit(data: DataFrame, y=None) MissingHandler

Fit imputers to the dataset.

Parameters:
Returns:

Fitted handler.

Return type:

MissingHandler

Raises:

Exception – For unexpected fitting errors.

fit_transform(data: DataFrame, y=None) DataFrame

Fit imputers and transform the dataset in one step.

Parameters:
Returns:

Transformed DataFrame with imputed values.

Return type:

pandas.DataFrame

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') MissingHandler

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') MissingHandler

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

transform(data: DataFrame) DataFrame

Impute missing values using fitted imputers.

Parameters:

data (pandas.DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with missing values imputed.

Return type:

pandas.DataFrame

Raises:
  • NotFittedError – If called before fit.

  • Exception – For unexpected transformation errors.

class proqsar.Preprocessor.Clean.rescaler.Rescaler(activity_col: str | None = None, id_col: str | None = None, select_method: str = 'MinMaxScaler', save_method: bool = False, save_dir: str | None = 'Project/Rescaler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)

Bases: BaseEstimator, TransformerMixin

Rescale (normalize or standardize) numerical columns in a dataset.

This class provides scaling methods such as Min-Max scaling, Standard scaling, and Robust scaling. It excludes identifier and activity columns, automatically detects non-binary columns for scaling, and optionally saves both the fitted scaler and transformed data.

Parameters:
  • activity_col (Optional[str]) – Column name containing activity labels to exclude from scaling.

  • id_col (Optional[str]) – Column name containing unique identifiers to exclude from scaling.

  • select_method (str) – Scaling method to use. Options are "MinMaxScaler", "StandardScaler", "RobustScaler", or "None". Default is "MinMaxScaler".

  • save_method (bool) – Whether to save the fitted rescaler model after fitting. Default is False.

  • save_dir (Optional[str]) – Directory where the rescaler model and transformed data will be saved. Default is "Project/Rescaler".

  • save_trans_data (bool) – Whether to save the transformed data as a CSV file. Default is False.

  • trans_data_name (str) – Base name for the transformed data file. Default is "trans_data".

  • deactivate (bool) – If True, disables scaling and returns unmodified data. Default is False.

Example

import pandas as pd
from proqsar.Preprocessor.rescaler import Rescaler

df = pd.DataFrame({
    "id": [1, 2, 3],
    "feature1": [0.1, 0.5, 0.9],
    "feature2": [10, 20, 30],
    "activity": [1.2, 3.4, 2.1]
})

rescaler = Rescaler(activity_col="activity", id_col="id", select_method="StandardScaler")
df_scaled = rescaler.fit_transform(df)

print(df_scaled)
fit(data: DataFrame, y=None) Rescaler

Fit the rescaler on the dataset.

Non-binary columns (not exclusively 0/1) are detected and used for fitting the scaler.

Parameters:
  • data (pd.DataFrame) – Dataset to fit on.

  • y (any, optional) – Ignored, included for compatibility with scikit-learn pipelines.

Returns:

Fitted Rescaler object.

Return type:

Rescaler

Raises:

Exception – If an error occurs during fitting.

fit_transform(data: DataFrame, y=None) DataFrame

Fit to data, then transform it.

Parameters:
  • data (pd.DataFrame) – Dataset to fit and transform.

  • y (any, optional) – Ignored, included for compatibility with scikit-learn pipelines.

Returns:

Transformed dataset.

Return type:

pd.DataFrame

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') Rescaler

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') Rescaler

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

setting(**kwargs)

Update settings of the Rescaler object.

Parameters:

kwargs (dict) – Keyword arguments mapping attribute names to new values.

Returns:

Updated Rescaler object.

Return type:

Rescaler

Raises:

KeyError – If a provided key is not a valid attribute of Rescaler.

transform(data: DataFrame) DataFrame

Transform the dataset using the fitted rescaler.

Parameters:

data (pd.DataFrame) – Dataset to transform.

Returns:

Transformed dataset.

Return type:

pd.DataFrame

Raises:
  • NotFittedError – If the rescaler has not been fitted yet.

  • Exception – If an error occurs during transformation.

Outlier handling

class proqsar.Preprocessor.Outlier.univariate_outliers.IQRHandler

Bases: object

Handler that removes rows containing univariate outliers based on IQR thresholds.

Typical use:

handler = IQRHandler() handler.fit(df) df_clean = handler.transform(df)

Variables:

iqr_thresholds (Optional[Dict[str, Dict[str, float]]]) – dictionary of thresholds created during fit.

fit(data: DataFrame) IQRHandler

Compute IQR thresholds from the provided data.

Parameters:

data (pd.DataFrame) – DataFrame used to compute thresholds.

Returns:

self (fitted handler).

Return type:

IQRHandler

fit_transform(data: DataFrame) DataFrame

Fit the handler and immediately transform the same data.

Parameters:

data (pd.DataFrame) – DataFrame to fit and transform.

Returns:

Filtered DataFrame.

Return type:

pd.DataFrame

transform(data: DataFrame) DataFrame

Remove rows that contain values outside the precomputed IQR thresholds.

Parameters:

data (pd.DataFrame) – DataFrame to filter.

Returns:

Filtered DataFrame with outlier rows removed.

Return type:

pd.DataFrame

Raises:

NotFittedError – If fit has not been called.

class proqsar.Preprocessor.Outlier.univariate_outliers.ImputationHandler(missing_thresh: float = 40.0, imputation_strategy: str = 'mean', n_neighbors: int = 5)

Bases: object

Handler that marks univariate outliers as NaN (based on IQR) and imputes them using MissingHandler.

Typical use:

ih = ImputationHandler(missing_thresh=40.0, imputation_strategy=’mean’) ih.fit(df) df_imputed = ih.transform(df)

Parameters:
  • missing_thresh (float) – Max allowed percent-missing per column (forwarded to MissingHandler).

  • imputation_strategy (str) – Imputation strategy passed to MissingHandler (‘mean’,’median’,’mode’,’knn’,’mice’).

  • n_neighbors (int) – neighbors used for KNN imputation when selected.

Variables:
  • iqr_thresholds (Optional[Dict[str, Dict[str, float]]]) – thresholds used to mark outliers as NaN.

  • imputation_handler (Optional[MissingHandler]) – fitted MissingHandler instance.

fit(data: DataFrame) ImputationHandler

Compute IQR thresholds and fit a MissingHandler on the NaN-marked data.

Parameters:

data (pd.DataFrame) – DataFrame used to compute thresholds and to fit imputer.

Returns:

self (fitted ImputationHandler).

Return type:

ImputationHandler

fit_transform(data: DataFrame) DataFrame

Fit and impute in one step.

Parameters:

data (pd.DataFrame) – DataFrame to fit & impute.

Returns:

Imputed DataFrame.

Return type:

pd.DataFrame

transform(data: DataFrame) DataFrame

Replace outliers with NaN according to fitted thresholds and impute them.

Parameters:

data (pd.DataFrame) – DataFrame to impute.

Returns:

Imputed DataFrame.

Return type:

pd.DataFrame

Raises:

NotFittedError – If fit has not been called.

class proqsar.Preprocessor.Outlier.univariate_outliers.UnivariateOutliersHandler(activity_col: str | None = None, id_col: str | None = None, select_method: str = 'uniform', imputation_strategy: str = 'mean', missing_thresh: float = 40.0, n_neighbors: int = 5, save_method: bool = False, save_dir: str | None = 'Project/OutlierHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)

Bases: BaseEstimator, TransformerMixin

High-level univariate outlier handler.

This class detects features with univariate outliers (via _feature_quality) and applies one of several handling strategies only to those features:

  • ‘iqr’ : remove rows outside IQR thresholds (IQRHandler)

  • ‘winsorization’ : cap values at thresholds (WinsorHandler)

  • ‘imputation’ : set outliers to NaN and impute (ImputationHandler)

  • ‘power’ : PowerTransformer()

  • ‘normal’ : QuantileTransformer(output_distribution=’normal’)

  • ‘uniform’ : QuantileTransformer(output_distribution=’uniform’)

Typical usage:

uoh = UnivariateOutliersHandler(select_method=’iqr’, id_col=’id’) uoh.fit(df) df_out = uoh.transform(df)

Parameters:
  • activity_col (Optional[str]) – Optional column name for activity/target to exclude from detection.

  • id_col (Optional[str]) – Optional column name for identifiers to exclude from detection.

  • select_method (str) – Chosen method key (one of the supported methods).

  • imputation_strategy (str) – Strategy forwarded to ImputationHandler when used.

  • missing_thresh (float) – Missing percent threshold forwarded to ImputationHandler.

  • n_neighbors (int) – KNN neighbors forwarded to ImputationHandler when used.

  • save_method (bool) – If True, saves the fitted handler as a pickle in save_dir.

  • save_dir (Optional[str]) – Directory used for saving pickles/CSVs.

  • save_trans_data (bool) – If True, transformed data is saved to CSV.

  • trans_data_name (str) – Filename base for saving transformed CSV.

  • deactivate (bool) – If True, fit/transform become no-ops and input is returned unchanged.

static compare_univariate_methods(data1: DataFrame, data2: DataFrame | None = None, data1_name: str = 'data1', data2_name: str = 'data2', activity_col: str | None = None, id_col: str | None = None, methods_to_compare: List[str] = None, save_dir: str | None = 'Project/OutlierHandler') DataFrame

Compare a set of univariate outlier handling methods by applying each to data1 and (optionally) data2 and summarizing how many rows remain / are removed.

Parameters:
  • data1 (pd.DataFrame) – Primary DataFrame to evaluate methods on.

  • data2 (Optional[pd.DataFrame]) – Optional secondary DataFrame to evaluate with the same fitted handlers.

  • data1_name (str) – Label used for dataset1 in the output table.

  • data2_name (str) – Label used for dataset2 in the output table.

  • activity_col (Optional[str]) – Optional activity/target column to exclude from detection.

  • id_col (Optional[str]) – Optional ID column to exclude from detection.

  • methods_to_compare (List[str]) – List of method keys to compare. Defaults to all supported methods.

  • save_dir (Optional[str]) – If provided, the comparison table CSV will be saved here.

Returns:

DataFrame summarizing for each method and dataset the row counts before/after handling.

Return type:

pd.DataFrame

Raises:

Exception – Propagates exceptions encountered during comparison.

fit(data: DataFrame, y=None) UnivariateOutliersHandler

Detect bad features and fit the selected outlier handling strategy.

Parameters:
  • data (pd.DataFrame) – Input DataFrame used to detect bad features and to fit the chosen handler.

  • y (Optional[pd.Series]) – Ignored; present for sklearn compatibility.

Returns:

self (fitted UnivariateOutliersHandler).

Return type:

UnivariateOutliersHandler

Raises:
  • ValueError – If an unsupported select_method is provided.

  • Exception – Propagates unexpected exceptions.

fit_transform(data: DataFrame, y=None) DataFrame

Fit the handler and immediately transform the provided data.

Parameters:
  • data (pd.DataFrame) – DataFrame to fit & transform.

  • y (Optional[pd.Series]) – Ignored; present for sklearn compatibility.

Returns:

Transformed DataFrame.

Return type:

pd.DataFrame

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') UnivariateOutliersHandler

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') UnivariateOutliersHandler

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

transform(data: DataFrame) DataFrame

Apply the fitted outlier handler to the detected bad features.

Only the columns flagged in self.bad are transformed; the rest of the DataFrame is preserved. If the chosen handler returns a numpy array, it is coerced into a DataFrame with the original column names.

Parameters:

data (pd.DataFrame) – DataFrame to transform.

Returns:

Transformed DataFrame with outlier handling applied.

Return type:

pd.DataFrame

Raises:
  • NotFittedError – If the handler has not been fitted.

  • Exception – Propagates unexpected exceptions during transformation.

class proqsar.Preprocessor.Outlier.univariate_outliers.WinsorHandler

Bases: object

Handler that applies Winsorization (capping) using IQR thresholds.

Typical use:

wh = WinsorHandler() wh.fit(df) df_capped = wh.transform(df)

Variables:

iqr_thresholds (Optional[Dict[str, Dict[str, float]]]) – dictionary of thresholds created during fit.

fit(data: DataFrame) WinsorHandler

Compute and store IQR thresholds.

Parameters:

data (pd.DataFrame) – DataFrame used to compute thresholds.

Returns:

self (fitted handler).

Return type:

WinsorHandler

fit_transform(data: DataFrame) DataFrame

Fit thresholds then apply Winsorization.

Parameters:

data (pd.DataFrame) – DataFrame to fit & transform.

Returns:

Winsorized DataFrame.

Return type:

pd.DataFrame

transform(data: DataFrame) DataFrame

Cap values below/above thresholds to low/high respectively.

Parameters:

data (pd.DataFrame) – DataFrame to apply Winsorization to.

Returns:

Winsorized DataFrame.

Return type:

pd.DataFrame

Raises:

NotFittedError – If fit has not been called.

class proqsar.Preprocessor.Outlier.kbin_handler.KBinHandler(activity_col: str | None = None, id_col: str | None = None, n_bins: int = 3, encode: str = 'ordinal', strategy: str = 'quantile', save_method: bool = False, save_dir: str | None = 'Project/KBinHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)

Bases: BaseEstimator, TransformerMixin

Discretize features identified as univariate outliers using sklearn.preprocessing.KBinsDiscretizer.

This handler detects “bad” features via _feature_quality(), fits a KBinsDiscretizer, and replaces bad features with binned columns (Kbin1, Kbin2, …).

Typical usage:

>>> kbin = KBinHandler(activity_col="activity", id_col="id", n_bins=3)
>>> kbin.fit(df)
>>> transformed = kbin.transform(df)
Parameters:
  • activity_col (Optional[str]) – Name of the activity/target column (if present).

  • id_col (Optional[str]) – Name of the identifier column (if present).

  • n_bins (int) – Number of bins to produce. Default is 3.

  • encode (str) – Encoding strategy {“ordinal”,”onehot”,”onehot-dense”}. Default is “ordinal”.

  • strategy (str) – Binning strategy {“uniform”,”quantile”,”kmeans”}. Default is “quantile”.

  • save_method (bool) – If True, save fitted handler as pickle.

  • save_dir (Optional[str]) – Directory to save pickled handler / CSV outputs. Default is “Project/KBinHandler”.

  • save_trans_data (bool) – If True, save transformed data to CSV.

  • trans_data_name (str) – Base filename for saving transformed CSV. Default is “trans_data”.

  • deactivate (bool) – If True, disable handler and return inputs unchanged.

Variables:
  • kbin (Optional[KBinsDiscretizer]) – Fitted KBinsDiscretizer after fit(), or None.

  • bad (list[str]) – Names of detected univariate outlier features.

  • transformed_data (pandas.DataFrame) – Stores the last transformed DataFrame.

fit(data: DataFrame, y=None) KBinHandler

Detect univariate outliers and fit KBinsDiscretizer on them.

Steps:
  1. Call _feature_quality() to detect “bad” features.

  2. If any, fit KBinsDiscretizer on those columns.

  3. Optionally save fitted handler as pickle.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame to fit on.

  • y (Any) – Ignored, present for sklearn compatibility.

Returns:

Fitted handler (self).

Return type:

KBinHandler

Raises:

Exception – If fitting fails unexpectedly.

fit_transform(data: DataFrame, y=None) DataFrame

Fit KBinsDiscretizer on bad features and transform in one call.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame to fit and transform.

  • y (Any) – Ignored, present for sklearn compatibility.

Returns:

Transformed DataFrame with discretized features.

Return type:

pandas.DataFrame

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') KBinHandler

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') KBinHandler

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

transform(data: DataFrame) DataFrame

Apply fitted KBinsDiscretizer to detected bad features.

  • If deactivated → return input unchanged.

  • If no bad features detected → return input unchanged.

  • Otherwise → replace bad features with new columns (“Kbin1”, “Kbin2”, …).

Parameters:

data (pandas.DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with discretized columns.

Return type:

pandas.DataFrame

Raises:

Exception – If transformation fails unexpectedly.

class proqsar.Preprocessor.Outlier.multivariate_outliers.MultivariateOutliersHandler(activity_col: str | None = None, id_col: str | None = None, select_method: str = 'LocalOutlierFactor', n_jobs: int = 1, random_state: int | None = 42, save_method: bool = False, save_dir: str | None = 'Project/MultivOutlierHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)

Bases: BaseEstimator, TransformerMixin

Detect and remove multivariate outliers from tabular datasets.

The handler supports several algorithms for multivariate outlier detection:

  • "LocalOutlierFactor"

  • "IsolationForest"

  • "OneClassSVM"

  • "RobustCovariance" (EllipticEnvelope with contamination=0.1)

  • "EmpiricalCovariance" (EllipticEnvelope with support_fraction=1)

Outliers are identified during fit() and removed during transform().

Parameters:
  • activity_col (Optional[str]) – Name of the activity/target column to ignore when fitting.

  • id_col (Optional[str]) – Name of the identifier column to ignore when fitting.

  • select_method (str) – Algorithm name to use for detection. One of {“LocalOutlierFactor”,”IsolationForest”,”OneClassSVM”, “RobustCovariance”,”EmpiricalCovariance”}.

  • n_jobs (int) – Number of parallel jobs (where supported). Default is 1.

  • random_state (Optional[int]) – Random seed for reproducibility where applicable. Default is 42.

  • save_method (bool) – If True, save the fitted handler as a pickle.

  • save_dir (Optional[str]) – Directory to store pickled handler / transformed data. Default is “Project/MultivOutlierHandler”.

  • save_trans_data (bool) – If True, save transformed DataFrame to CSV.

  • trans_data_name (str) – Base filename for saving transformed CSV.

  • deactivate (bool) – If True, disables the handler. Methods become no-ops.

Variables:
  • multi_outlier_handler (object | None) – The fitted estimator instance, or None.

  • features (pandas.Index | None) – List of feature column names used in fitting.

  • data_fit (pandas.DataFrame) – The feature matrix used at fit time.

  • transformed_data (pandas.DataFrame | None) – Last transformed DataFrame.

static compare_multivariate_methods(data1: DataFrame, data2: DataFrame | None = None, data1_name: str = 'data1', data2_name: str = 'data2', activity_col: str | None = None, id_col: str | None = None, methods_to_compare: List[str] | None = None, save_dir: str | None = 'Project/OutlierHandler') DataFrame

Compare multiple outlier detection methods across datasets.

Parameters:
  • data1 (pandas.DataFrame) – Primary dataset.

  • data2 (Optional[pandas.DataFrame]) – Optional second dataset for evaluation.

  • data1_name (str) – Label for dataset1 in results.

  • data2_name (str) – Label for dataset2 in results.

  • activity_col (Optional[str]) – Activity/target column name to exclude.

  • id_col (Optional[str]) – Identifier column name to exclude.

  • methods_to_compare (Optional[List[str]]) – List of algorithms to compare. If None, defaults to all.

  • save_dir (Optional[str]) – If set, saves comparison results CSV to this directory.

Returns:

Summary table with rows removed for each method and dataset.

Return type:

pandas.DataFrame

Raises:

Exception – If comparison fails unexpectedly.

fit(data: DataFrame, y=None) MultivariateOutliersHandler

Fit the selected outlier detector on the given dataset.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame containing features and optional id/activity columns.

  • y (Any) – Ignored (sklearn API compatibility).

Returns:

Fitted handler (self).

Return type:

MultivariateOutliersHandler

Raises:
  • ValueError – If select_method is not supported.

  • Exception – If fitting fails unexpectedly.

fit_transform(data: DataFrame, y=None) DataFrame

Fit the outlier detector and immediately transform the data.

Parameters:
  • data (pandas.DataFrame) – Input dataset to fit and filter.

  • y (Any) – Ignored (sklearn API compatibility).

Returns:

Transformed DataFrame with outliers removed.

Return type:

pandas.DataFrame

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') MultivariateOutliersHandler

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') MultivariateOutliersHandler

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

transform(data: DataFrame) DataFrame

Remove rows flagged as outliers.

  • For LocalOutlierFactor, supports both in-sample and novelty detection.

  • For other estimators, uses predict() with outliers = -1.

Parameters:

data (pandas.DataFrame) – DataFrame with the same feature columns as used in fit().

Returns:

DataFrame with outlier rows removed.

Return type:

pandas.DataFrame

Raises:
  • NotFittedError – If called before fit().

  • Exception – If transformation fails unexpectedly.

Model

Feature Selector

class proqsar.Model.FeatureSelector.feature_selector.FeatureSelector(activity_col: str = 'activity', id_col: str = 'id', select_method: str | List[str] | None = None, add_method: dict | None = None, cross_validate: bool = True, save_method: bool = False, save_trans_data: bool = False, trans_data_name: str = 'trans_data', save_dir: str | None = 'Project/FeatureSelector', n_jobs: int = 1, random_state: int | None = 42, deactivate: bool = False, **kwargs)

Bases: CrossValidationConfig, BaseEstimator

Pipeline component for feature selection.

This class wraps multiple feature-selection strategies and provides an estimator-like interface, making it compatible with scikit-learn pipelines.

Key behaviors:
  • If select_method is a list (or None) and cross_validate=True, evaluates candidate selectors with repeated CV and selects the best one based on scoring_target.

  • If select_method is a string, directly fits the corresponding selector.

  • Provides fit, transform, fit_transform and set_params methods.

  • Supports saving fitted models and transformed datasets.

Parameters:
  • activity_col (str) – Column name for the target variable. Default is "activity".

  • id_col (str) – Column name for record identifiers. Default is "id".

  • select_method (Optional[Union[str, List[str]]]) – Method name or list of method names. If None, all methods are compared.

  • add_method (Optional[dict]) – Extra methods to add to the method map (name → selector instance).

  • cross_validate (bool) – If True, compare candidate methods with CV. Default is True.

  • save_method (bool) – If True, save the fitted FeatureSelector object as pickle. Default is False.

  • save_trans_data (bool) – If True, save transformed datasets to CSV. Default is False.

  • trans_data_name (str) – Base filename for transformed datasets. Default is "trans_data".

  • save_dir (Optional[str]) – Directory for saving models and transformed data. Default is "Project/FeatureSelector".

  • n_jobs (int) – Number of parallel jobs for supported estimators. Default is 1.

  • random_state (Optional[int]) – Random seed for reproducibility. Default is 42.

  • deactivate (bool) – If True, disables feature selection (fit/transform skipped). Default is False.

  • kwargs – Additional arguments forwarded to CrossValidationConfig.

Example

import pandas as pd
from proqsar.FeatureSelector.feature_selector import FeatureSelector

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "feature1": [0.1, 0.2, 0.3, 0.4, 0.5],
    "feature2": [5, 4, 3, 2, 1],
    "activity": [0, 1, 0, 1, 0]
})

selector = FeatureSelector(
    activity_col="activity",
    id_col="id",
    select_method=["Anova", "MutualInformation"],
    cross_validate=True
)

selector.fit(df)
df_transformed = selector.transform(df)

print(df_transformed.head())
fit(data: DataFrame) FeatureSelector

Fit feature selector(s) on the dataset.

Parameters:

data (pd.DataFrame) – Input DataFrame containing features, id column, and activity column.

Returns:

Self, with fitted selector and optional CV report.

Return type:

FeatureSelector

Raises:
  • ValueError – If select_method is invalid or not recognized.

  • AttributeError – If a list of methods is provided without cross_validate=True.

  • Exception – For unexpected runtime errors.

fit_transform(data: DataFrame) DataFrame

Fit to the dataset, then transform it.

Parameters:

data (pd.DataFrame) – Input DataFrame.

Returns:

Transformed dataset.

Return type:

pd.DataFrame

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') FeatureSelector

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_params(**kwargs) FeatureSelector

Update attributes with provided keyword arguments.

Parameters:

kwargs (dict) – Mapping of attribute names to values.

Returns:

Updated FeatureSelector object.

Return type:

FeatureSelector

Raises:

KeyError – If an invalid attribute name is provided.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') FeatureSelector

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in transform.

Returns

selfobject

The updated object.

transform(data: DataFrame) DataFrame

Transform dataset using the fitted selector.

Parameters:

data (pd.DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with selected features and preserved id/activity columns.

Return type:

pd.DataFrame

Raises:
  • NotFittedError – If fit has not been called before.

  • Exception – For unexpected runtime errors.

Model Developer

class proqsar.Model.ModelDeveloper.model_developer.ModelDeveloper(activity_col: str = 'activity', id_col: str = 'id', select_model: str | List[str] | None = None, add_model: dict = {}, cross_validate: bool = True, save_model: bool = False, save_pred_result: bool = False, pred_result_name: str = 'pred_result', save_dir: str | None = 'Project/ModelDeveloper', n_jobs: int = 1, random_state: int | None = 42, **kwargs)

Bases: CrossValidationConfig, BaseEstimator

Wrapper for model selection, cross-validated evaluation, model fitting and prediction.

This class:
  • infers the task type (classification/regression) from the data,

  • constructs a default model map (mergeable with add_model),

  • optionally cross-validates candidate models and selects the best one,

  • fits the selected model on the full provided dataset,

  • exposes predict to create a predictions DataFrame,

  • optionally saves the fitted ModelDeveloper instance or prediction results.

Parameters:
  • activity_col (str) – Column name for the target variable.

  • id_col (str) – Column name for the identifier column.

  • select_model (Optional[Union[str, List[str]]]) – Name of the model to use or a list of candidate names to evaluate. If None and cross_validate=True, all models in the map are compared.

  • add_model (dict) – Additional models to include in the model map (name -> estimator or (estimator, …)).

  • cross_validate (bool) – Whether to run cross-validation to select among candidate models.

  • save_model (bool) – If True, save the fitted ModelDeveloper object (pickle) to save_dir.

  • save_pred_result (bool) – If True, save prediction results to CSV when predict is called.

  • pred_result_name (str) – Filename (without directory) for saved prediction results.

  • save_dir (Optional[str]) – Directory for saving model/prediction files.

  • n_jobs (int) – Number of parallel jobs passed to underlying estimators.

  • random_state (Optional[int]) – Random seed for reproducible estimators.

  • kwargs – Forwarded to CrossValidationConfig for CV-related parameters

(e.g., n_splits, scoring_target, scoring_list). :type kwargs: dict

fit(data: DataFrame) ModelDeveloper

Fit (or select and fit) the model on the provided dataset.

Behavior:
  • Infers task type and CV strategy,

  • Builds the model map merged with add_model,

  • If select_model is None or a list and cross_validate is True, runs cross-validation to select the best model and fits it on full data.

  • If select_model is a string, fits that model directly and optionally runs CV.

  • Saves the fitted ModelDeveloper instance if save_model is True.

Parameters:

data (pd.DataFrame) – DataFrame containing features and the activity/id columns.

Returns:

The fitted ModelDeveloper instance.

Return type:

ModelDeveloper

Raises:

Exception – Any unexpected exception is logged and re-raised.

predict(data: DataFrame) DataFrame

Generate predictions for the provided data using the fitted model.

The method returns a DataFrame that always contains the id column and a ‘Predicted value’ column, and includes the true activity values if available. For classification tasks, probability columns for each class are also included.

Parameters:

data (pd.DataFrame) – DataFrame containing features and id/activity columns.

Returns:

DataFrame with prediction results and optionally saved to CSV if save_pred_result is True.

Return type:

pd.DataFrame

Raises:
  • NotFittedError – If fit has not been called and the internal model is not present.

  • Exception – Any unexpected exception is logged and re-raised.

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') ModelDeveloper

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in fit.

Returns

selfobject

The updated object.

set_params(**kwargs)

Update attributes of the ModelDeveloper instance.

Only existing attributes may be updated; unknown keys raise KeyError. Returns self to allow fluent chaining.

Parameters:

kwargs (dict) – Attribute names and their new values.

Returns:

The same instance with updated attributes.

Return type:

ModelDeveloper

Raises:

KeyError – If a provided key does not correspond to an existing attribute.

set_predict_request(*, data: bool | None | str = '$UNCHANGED$') ModelDeveloper

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for data parameter in predict.

Returns

selfobject

The updated object.

Optimizer

class proqsar.Model.Optimizer.optimizer.Optimizer(activity_col: str = 'activity', id_col: str = 'id', select_model: List[str] | None = None, scoring: str | None = None, param_ranges: Dict[str, Dict[str, Any]] = {}, add_model: Dict[str, Tuple[Any, Dict[str, Any]]] = {}, n_trials: int = 50, n_splits: int = 5, n_repeats: int = 2, n_jobs: int = 1, random_state: int = 42, study_name: str = 'my_study', deactivate: bool = False)

Bases: BaseEstimator

Optimize hyperparameters for one or more candidate models using Optuna.

The Optimizer supports:
  • specifying which models to search over (select_model),

  • custom parameter ranges for each model (param_ranges),

  • adding custom models (add_model: mapping model_name -> (estimator, param_ranges)),

  • repeated cross-validation for robust scoring,

  • retrieving best parameters and score after optimization.

Parameters:
  • activity_col (str) – Column name for the target variable (default: “activity”).

  • id_col (str) – Column name for the identifier column (default: “id”).

  • select_model (list[str] | None) – Optional list of model names to evaluate. If None, the default model list for the detected task will be used.

  • scoring (str | None) – Scoring metric name used by sklearn (e.g., ‘f1’, ‘r2’). If None, defaults to ‘f1’ for classification and ‘r2’ for regression.

  • param_ranges (dict) – Mapping model_name -> parameter ranges used by the trial sampler. Example: {“RandomForestClassifier”: {“n_estimators”: (50,200)}}.

  • add_model (dict) – Mapping of custom models to add. Expected format: {name: (estimator_instance, param_range_dict)}.

  • n_trials (int) – Number of Optuna trials to run (default: 50).

  • n_splits (int) – Number of CV folds (default: 5).

  • n_repeats (int) – Number of CV repeats (default: 2).

  • n_jobs (int) – Number of parallel jobs passed to cross_val_score and some estimators.

  • random_state (int) – Random seed used for reproducibility (default: 42).

  • study_name (str) – Optuna study name / storage key base (default: ‘my_study’).

  • deactivate (bool) – If True, optimization is skipped and the instance is returned as-is.

get_best_params() Dict[str, Any]

Return the best hyperparameter dictionary found by the last optimize() call.

Returns:

Best parameters dictionary.

Return type:

Dict[str, Any]

Raises:

AttributeError – If optimize() has not been run and best_params is not set.

get_best_score() float

Return the best cross-validated score found by the last optimize() call.

Returns:

Best cross-validated score.

Return type:

float

Raises:

AttributeError – If optimize() has not been run and best_score is not set.

optimize(data: DataFrame) Tuple[Dict[str, Any], float] | Optimizer

Run the Optuna optimization process to find the best hyperparameters.

Steps:
  • Infer task type and CV splitting strategy.

  • Build the list of candidate models (either user-provided or the default from _get_model_list).

  • Define an Optuna objective that samples model name (if multiple) and hyperparameters, sets them on the model, and evaluates via cross_val_score using the configured CV splitter.

  • Create or load an Optuna study (SQLite storage ‘example.db’) and run the specified number of trials.

  • Store best_params and best_score on the instance and return them.

Parameters:

data (pd.DataFrame) – DataFrame containing feature columns and the activity/id columns.

Returns:

(best_params, best_score) tuple on success or self if deactivated.

Return type:

Tuple[Dict[str, Any], float] | Optimizer

Raises:

Exception – Any unexpected exceptions are logged and re-raised.

Automation

Pipeline

Inference

Inference-focused runner that prepares inputs, calls a prediction pipeline, and writes results back in-place by default.

The runner stores light metadata after each run:
  • last_input_df: full input DataFrame after prediction (deep-copied when possible)

  • last_preds: DataFrame or Series-like predictions captured from the pipeline

  • last_run_time, last_n, last_prediction_summary

The pretty __repr__ produces a concise box showing inference statistics:
  • prediction mean/std/quantiles

  • Applicability Domain (AD) counts if present in the input frame

  • largest Prediction Interval (PI) range if PI lower/upper columns exist

  • top / bottom K predicted items (shows SMILES if available)

param pipeline:

Object exposing required attributes and method: id_col, smiles_col, activity_col, and a callable predict(df, alpha=…) which returns a DataFrame, Series/array-like, or mapping of prediction values.

type pipeline:

object

param inplace:

If True and the provided input is a pandas DataFrame, mutate it in-place. If False a copy is used and returned. Default: True.

type inplace:

bool

param alpha:

Default alpha forwarded to pipeline.predict. Default: 0.05.

type alpha:

float

param logger:

Optional logger to use for exceptions and debug messages. If None, the module logger is used.

type logger:

logging.Logger | None