API
Data
Standardizer
- class proqsar.Data.Standardizer.SMILESStandardizer(smiles_col: str = 'SMILES', normalize: bool = True, tautomerize: bool = True, remove_salts: bool = False, handle_charges: bool = False, uncharge: bool = False, handle_stereo: bool = True, remove_fragments: bool = False, largest_fragment_only: bool = False, n_jobs: int = 1, deactivate: bool = False)
Bases:
BaseEstimatorClass for comprehensive standardization of chemical structures represented in SMILES format.
This class provides a configurable pipeline of standardization steps using RDKit. It ensures molecules are normalized, tautomers canonicalized, salts removed, charges adjusted, stereochemistry assigned, fragments filtered, and hydrogen handling is consistent. The standardized molecules are output as both RDKit Mol objects and SMILES strings, suitable for downstream cheminformatics workflows.
Methods
- smiles2mol(smiles: str) -> Optional[Chem.Mol]
Convert a SMILES string to an RDKit Mol object.
- standardize_mol(mol: Chem.Mol) -> Optional[Chem.Mol]
Apply all configured standardization steps to an RDKit Mol object.
- standardize_smiles(smiles: str) -> Tuple[Optional[str], Optional[Chem.Mol]]
Convert and standardize a SMILES string, returning both the canonical SMILES and RDKit Mol object.
- standardize_dict_smiles(data_input: Union[pd.DataFrame, List[dict]])
Apply standardization to all SMILES in a DataFrame or list of dicts.
- param smiles_col:
Column/key name containing SMILES strings in the input data.
- type smiles_col:
str, optional
- param normalize:
If True, normalize molecules (aromaticity, functional groups, etc.).
- type normalize:
bool, optional
- param tautomerize:
If True, canonicalize tautomers into a single representation.
- type tautomerize:
bool, optional
- param remove_salts:
If True, strip counter-ions and salt fragments.
- type remove_salts:
bool, optional
- param handle_charges:
If True, reionize charges to standard protonation states.
- type handle_charges:
bool, optional
- param uncharge:
If True, remove charges by neutralizing charged species.
- type uncharge:
bool, optional
- param handle_stereo:
If True, assign or clean stereochemistry information.
- type handle_stereo:
bool, optional
- param remove_fragments:
If True, discard extra fragments (e.g., keep only parent molecule).
- type remove_fragments:
bool, optional
- param largest_fragment_only:
If True, retain only the largest connected fragment.
- type largest_fragment_only:
bool, optional
- param n_jobs:
Number of parallel jobs to use when standardizing a batch of molecules.
- type n_jobs:
int, optional
- param deactivate:
If True, disable all standardization steps (useful for debugging).
- type deactivate:
bool, optional
- static smiles2mol(smiles: str) Mol | None
Convert a SMILES string to RDKit Mol object.
- Parameters:
smiles (str) – SMILES string to be converted.
- Returns:
RDKit Mol object or None if conversion fails.
- Return type:
Optional[Chem.Mol]
- standardize_dict_smiles(data_input: DataFrame | List[dict]) DataFrame | List[dict]
Standardize SMILES strings within a pandas DataFrame or a list of dictionaries using parallel processing.
- Parameters:
data_input (Union[pandas.DataFrame, List[dict]]) – Data containing SMILES strings to be standardized. Can be a pandas DataFrame or list of dicts.
- Returns:
Input data with additional standardized SMILES and Mol columns/keys.
- Return type:
Union[pandas.DataFrame, List[dict]]
- Raises:
- standardize_mol(mol: Mol) Mol | None
Standardize an RDKit Mol object using various chemical standardization steps.
- Parameters:
mol (Chem.Mol) – The molecule to be standardized.
- Returns:
The standardized molecule or None if it cannot be processed.
- Return type:
Optional[Chem.Mol]
- Raises:
ValueError – If the input molecule is None.
- proqsar.Data.Standardizer.assign_stereochemistry(mol: Mol, cleanIt: bool = True, force: bool = True) None
Assign stereochemistry to a molecule using RDKit’s AssignStereochemistry.
- proqsar.Data.Standardizer.canonicalize_tautomer(mol: Mol) Mol
Canonicalize the tautomer of a molecule using RDKit’s TautomerCanonicalizer.
- Parameters:
mol (Chem.Mol) – The RDKit molecule object.
- Returns:
The molecule object with canonicalized tautomer.
- Return type:
Chem.Mol
>>> mol = Chem.MolFromSmiles("O=C1NC=CC1=O") >>> canonicalized = canonicalize_tautomer(mol)
- proqsar.Data.Standardizer.fragments_remover(mol: Mol) Mol | None
Remove small fragments from a molecule, keeping only the largest one.
- Parameters:
mol (Chem.Mol) – The RDKit molecule object.
- Returns:
The molecule object with only the largest fragment kept, or None if fragment removal fails.
- Return type:
Optional[Chem.Mol]
>>> mol = Chem.MolFromSmiles("CCC.CCCO") >>> largest = fragments_remover(mol)
- proqsar.Data.Standardizer.normalize_molecule(mol: Mol) Mol
Normalize a molecule using RDKit’s Normalizer to correct functional groups and recharges.
- Parameters:
mol (Chem.Mol) – The RDKit molecule object to be normalized.
- Returns:
The normalized RDKit molecule object.
- Return type:
Chem.Mol
>>> mol = Chem.MolFromSmiles("CC(=O)O") >>> normalized = normalize_molecule(mol)
- proqsar.Data.Standardizer.reionize_charges(mol: Mol) Mol
Adjust a molecule to its most likely ionic state using RDKit’s Reionizer.
- Parameters:
mol (Chem.Mol) – The RDKit molecule object.
- Returns:
The molecule object with reionized charges.
- Return type:
Chem.Mol
>>> mol = Chem.MolFromSmiles("CC[NH3+]") >>> reionized = reionize_charges(mol)
- proqsar.Data.Standardizer.remove_hydrogens_and_sanitize(mol: Mol) Mol | None
Remove explicit hydrogens and sanitize a molecule.
- Parameters:
mol (Chem.Mol) – The RDKit molecule object.
- Returns:
The molecule object with explicit hydrogens removed and sanitized, or None if sanitization fails.
- Return type:
Optional[Chem.Mol]
>>> mol = Chem.MolFromSmiles("CCO") >>> clean_mol = remove_hydrogens_and_sanitize(mol)
- proqsar.Data.Standardizer.salts_remover(mol: Mol) Mol
Remove salt fragments from a molecule using RDKit’s SaltRemover.
- Parameters:
mol (Chem.Mol) – The RDKit molecule object.
- Returns:
The molecule object with salts removed.
- Return type:
Chem.Mol
>>> mol = Chem.MolFromSmiles("CCO.Na") >>> desalted = salts_remover(mol)
- proqsar.Data.Standardizer.uncharge_molecule(mol: Mol) Mol
Neutralize a molecule by removing charges using RDKit’s Uncharger.
- Parameters:
mol (Chem.Mol) – The RDKit molecule object.
- Returns:
The neutralized molecule object.
- Return type:
Chem.Mol
>>> mol = Chem.MolFromSmiles("CC[NH3+].[Cl-]") >>> uncharged = uncharge_molecule(mol)
Featurizer
Splitter
- class proqsar.Data.Splitter.data_splitter.Splitter(activity_col: str = 'activity', smiles_col: str = 'SMILES', mol_col: str = 'mol', option: str = 'random', test_size: float = 0.2, n_splits: int = 5, cutoff: float = 0.35, random_state: int = 42, save_dir: str | None = 'Project/Splitter', data_name: str | None = None, deactivate: bool = False)
Bases:
BaseEstimatorUnified interface for dataset partitioning into train/test subsets.
This class provides a common interface to perform different dataset splitting strategies, such as random splitting, stratified random splitting, scaffold-based splitting, and scaffold-based stratified splitting. It also handles optional saving of train/test splits to disk.
- Parameters:
activity_col (str) – Name of the column representing the activity or target label. Default is
"activity".smiles_col (str) – Name of the column containing SMILES strings for molecular data. Default is
"SMILES".mol_col (str) – Name of the column containing RDKit Mol objects (if available). Default is
"mol".option (str) – Splitting method, one of
"random","stratified_random","scaffold","random_scaffold","stratified_scaffold","butina". Default is"random".test_size (float) – Proportion of the dataset to include in the test split. Default is
0.2.n_splits (int) – Number of folds for stratified splitting (used only when option is
"stratified_scaffold"). Default is5.cutoff (float) – Tanimoto distance cutoff used in
"butina"splitting. Lower values produce smaller, finer clusters (stricter similarity), while higher values produce larger, coarser clusters. Only used whenoption="butina". Default is0.6.random_state (int) – Random seed used by the random number generator. Default is
42.save_dir (Optional[str]) – Directory where train/test CSV files are saved. If
None, no files are written. Default is"Project/Splitter".data_name (Optional[str]) – Optional name suffix for saved train/test files. Default is
None.deactivate (bool) – If True, disables splitting and returns the full dataset as training set. Default is
False.
Example
import pandas as pd from proqsar.Data.Splitter.splitter import Splitter df = pd.DataFrame({ "id": [1, 2, 3, 4, 5], "SMILES": ["CCO", "CCC", "CCN", "CCCl", "CCBr"], "activity": [1.2, 3.4, 2.1, 0.5, 4.7] }) splitter = Splitter(option="random", test_size=0.4, random_state=0) train, test = splitter.fit(df) print(train.shape, test.shape) # (3, 2) (2, 2)
- fit(data: DataFrame | List[Dict]) Tuple[DataFrame, DataFrame | None]
Split the dataset into training and testing sets.
- Parameters:
data (pd.DataFrame) – Input dataset containing at least the activity column and SMILES column.
- Returns:
A tuple
(train_df, test_df)where -train_dfis the training dataset (with SMILES and Mol columns dropped). -test_dfis the testing dataset (with SMILES and Mol columns dropped),or
Noneifdeactivate=True.- Return type:
Tuple[pd.DataFrame, Optional[pd.DataFrame]]
- Raises:
ValueError – If an invalid splitting option is provided.
Exception – If an unexpected error occurs during splitting.
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') Splitter
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
Preprocessor
Data Cleaning
- class proqsar.Preprocessor.Clean.duplicate_handler.DuplicateHandler(activity_col: str | None = None, id_col: str | None = None, cols: bool = True, rows: bool = True, keep: str = 'mean', random_state: int | None = 42, save_method: bool = False, save_dir: str = 'Project/DuplicateHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)
Bases:
BaseEstimator,TransformerMixinA preprocessing transformer to detect and remove duplicate columns and rows in a pandas DataFrame.
Duplicate columns are removed (if cols=True) using exact column equality.
Duplicate rows are consolidated (if rows=True) based on all feature columns (i.e., all columns except id_col, activity_col, and any removed dup columns). The consolidation strategy for the activity column is controlled by keep:
‘first’ : keep the first occurrence as-is
‘last’ : keep the last occurrence as-is
‘random’ : keep a random occurrence (requires random_state for determinism)
‘min’ : keep the row with minimum activity
‘max’ : keep the row with maximum activity
‘mean’ : collapse duplicates and set activity to the mean
‘median’ : collapse duplicates and set activity to the median
For ‘mean’ / ‘median’, the first row of the group is retained and its activity value is replaced by the aggregated statistic.
Supports saving the fitted handler and transformed data for reproducibility.
- fit(data: DataFrame, y=None) DuplicateHandler
Fit the handler by identifying duplicate columns.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to inspect for duplicate columns.
y (Optional[pandas.Series]) – Ignored. Present for sklearn compatibility.
- Returns:
The fitted DuplicateHandler instance.
- Return type:
- Raises:
Exception – If an unexpected error occurs during fitting.
- fit_transform(data: DataFrame, y=None) DataFrame
Fit the handler and then transform the data.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to fit and transform.
y (Optional[pandas.Series]) – Ignored. Present for sklearn compatibility.
- Returns:
Transformed DataFrame with duplicates removed.
- Return type:
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') DuplicateHandler
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') DuplicateHandler
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- transform(data: DataFrame) DataFrame
Transform the DataFrame by removing duplicate rows and columns.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to transform.
- Returns:
Transformed DataFrame with duplicates removed.
- Return type:
- Raises:
ValueError – If a required column is missing.
Exception – For any unexpected error during transformation.
- class proqsar.Preprocessor.Clean.low_variance_handler.LowVarianceHandler(activity_col: str | None = None, id_col: str | None = None, var_thresh: float = 0.05, save_method: bool = False, visualize: bool = False, save_image: bool = False, image_name: str = 'variance_analysis', save_dir: str = 'Project/LowVarianceHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)
Bases:
BaseEstimator,TransformerMixinPreprocessing transformer that removes low-variance features.
Features with variance below a specified threshold are dropped. Supports visualization, saving fitted objects, and saving transformed data.
- fit(data: DataFrame, y=None) LowVarianceHandler
Fit the handler by determining which features exceed the variance threshold.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to fit on.
y (Optional[pandas.Series]) – Ignored. Present for sklearn compatibility.
- Returns:
The fitted LowVarianceHandler instance.
- Return type:
- fit_transform(data: DataFrame, y=None) DataFrame
Fit the handler and transform the data.
- Parameters:
data (pandas.DataFrame) – Input DataFrame.
y (Optional[pandas.Series]) – Ignored. Present for sklearn compatibility.
- Returns:
Transformed DataFrame with selected features retained.
- Return type:
- static select_features_by_variance(data: DataFrame, activity_col: str | None = None, id_col: str | None = None, var_thresh: float = 0.05) list
Select features that pass the variance threshold.
- Parameters:
data (pandas.DataFrame) – Input DataFrame.
activity_col (Optional[str]) – Activity column to exclude from selection.
id_col (Optional[str]) – ID column to exclude from selection.
var_thresh (float) – Minimum variance required to retain a feature.
- Returns:
List of selected feature names.
- Return type:
- Raises:
Exception – If variance selection fails.
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') LowVarianceHandler
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') LowVarianceHandler
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- transform(data: DataFrame) DataFrame
Transform the data by keeping only selected features.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to transform.
- Returns:
Transformed DataFrame with only retained features.
- Return type:
- Raises:
NotFittedError – If called before
fit.Exception – For unexpected errors during transformation.
- static variance_threshold_analysis(data: DataFrame, activity_col: str | None = None, id_col: str | None = None, set_style: str = 'whitegrid', save_image: bool = False, image_name: str = 'variance_analysis', save_dir: str = 'Project/VarianceHandler') None
Perform variance threshold analysis on non-binary features and plot retained feature counts as threshold increases.
- Parameters:
data (pandas.DataFrame) – Input DataFrame.
activity_col (Optional[str]) – Activity column to exclude from analysis.
id_col (Optional[str]) – ID column to exclude from analysis.
set_style (str) – Seaborn plot style (default “whitegrid”).
save_image (bool) – Whether to save the plot as an image.
image_name (str) – Base filename for saved image.
save_dir (str) – Directory to save plot if
save_image=True.
- Returns:
None
- Return type:
None
- Raises:
Exception – If variance analysis fails.
- class proqsar.Preprocessor.Clean.missing_handler.MissingHandler(activity_col: str | None = None, id_col: str | None = None, missing_thresh: float = 40.0, imputation_strategy: str = 'mean', n_neighbors: int = 5, save_method: bool = False, save_dir: str | None = 'Project/MissingHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)
Bases:
BaseEstimator,TransformerMixin- Handle missing values by:
dropping columns with too many missing values,
imputing binary and non-binary columns separately,
supporting multiple imputation strategies.
Supports saving fitted imputers and transformed data for reproducibility.
- static calculate_missing_percent(data: DataFrame) DataFrame
Compute percentage of missing values per column.
- Parameters:
data (pandas.DataFrame) – Input DataFrame.
- Returns:
DataFrame with columns [“ColumnName”,”MissingPercent”].
- Return type:
- fit(data: DataFrame, y=None) MissingHandler
Fit imputers to the dataset.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to fit on.
y (Optional[pandas.Series]) – Ignored, present for sklearn compatibility.
- Returns:
Fitted handler.
- Return type:
- Raises:
Exception – For unexpected fitting errors.
- fit_transform(data: DataFrame, y=None) DataFrame
Fit imputers and transform the dataset in one step.
- Parameters:
data (pandas.DataFrame) – Input DataFrame.
y (Optional[pandas.Series]) – Ignored, present for sklearn compatibility.
- Returns:
Transformed DataFrame with imputed values.
- Return type:
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') MissingHandler
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') MissingHandler
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- transform(data: DataFrame) DataFrame
Impute missing values using fitted imputers.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to transform.
- Returns:
Transformed DataFrame with missing values imputed.
- Return type:
- Raises:
NotFittedError – If called before
fit.Exception – For unexpected transformation errors.
- class proqsar.Preprocessor.Clean.rescaler.Rescaler(activity_col: str | None = None, id_col: str | None = None, select_method: str = 'MinMaxScaler', save_method: bool = False, save_dir: str | None = 'Project/Rescaler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)
Bases:
BaseEstimator,TransformerMixinRescale (normalize or standardize) numerical columns in a dataset.
This class provides scaling methods such as Min-Max scaling, Standard scaling, and Robust scaling. It excludes identifier and activity columns, automatically detects non-binary columns for scaling, and optionally saves both the fitted scaler and transformed data.
- Parameters:
activity_col (Optional[str]) – Column name containing activity labels to exclude from scaling.
id_col (Optional[str]) – Column name containing unique identifiers to exclude from scaling.
select_method (str) – Scaling method to use. Options are
"MinMaxScaler","StandardScaler","RobustScaler", or"None". Default is"MinMaxScaler".save_method (bool) – Whether to save the fitted rescaler model after fitting. Default is
False.save_dir (Optional[str]) – Directory where the rescaler model and transformed data will be saved. Default is
"Project/Rescaler".save_trans_data (bool) – Whether to save the transformed data as a CSV file. Default is
False.trans_data_name (str) – Base name for the transformed data file. Default is
"trans_data".deactivate (bool) – If True, disables scaling and returns unmodified data. Default is
False.
Example
import pandas as pd from proqsar.Preprocessor.rescaler import Rescaler df = pd.DataFrame({ "id": [1, 2, 3], "feature1": [0.1, 0.5, 0.9], "feature2": [10, 20, 30], "activity": [1.2, 3.4, 2.1] }) rescaler = Rescaler(activity_col="activity", id_col="id", select_method="StandardScaler") df_scaled = rescaler.fit_transform(df) print(df_scaled)
- fit(data: DataFrame, y=None) Rescaler
Fit the rescaler on the dataset.
Non-binary columns (not exclusively 0/1) are detected and used for fitting the scaler.
- fit_transform(data: DataFrame, y=None) DataFrame
Fit to data, then transform it.
- Parameters:
data (pd.DataFrame) – Dataset to fit and transform.
y (any, optional) – Ignored, included for compatibility with scikit-learn pipelines.
- Returns:
Transformed dataset.
- Return type:
pd.DataFrame
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') Rescaler
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') Rescaler
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- setting(**kwargs)
Update settings of the Rescaler object.
- transform(data: DataFrame) DataFrame
Transform the dataset using the fitted rescaler.
- Parameters:
data (pd.DataFrame) – Dataset to transform.
- Returns:
Transformed dataset.
- Return type:
pd.DataFrame
- Raises:
NotFittedError – If the rescaler has not been fitted yet.
Exception – If an error occurs during transformation.
Outlier handling
- class proqsar.Preprocessor.Outlier.univariate_outliers.IQRHandler
Bases:
objectHandler that removes rows containing univariate outliers based on IQR thresholds.
- Typical use:
handler = IQRHandler() handler.fit(df) df_clean = handler.transform(df)
- Variables:
iqr_thresholds (Optional[Dict[str, Dict[str, float]]]) – dictionary of thresholds created during fit.
- fit(data: DataFrame) IQRHandler
Compute IQR thresholds from the provided data.
- Parameters:
data (pd.DataFrame) – DataFrame used to compute thresholds.
- Returns:
self (fitted handler).
- Return type:
- class proqsar.Preprocessor.Outlier.univariate_outliers.ImputationHandler(missing_thresh: float = 40.0, imputation_strategy: str = 'mean', n_neighbors: int = 5)
Bases:
objectHandler that marks univariate outliers as NaN (based on IQR) and imputes them using MissingHandler.
- Typical use:
ih = ImputationHandler(missing_thresh=40.0, imputation_strategy=’mean’) ih.fit(df) df_imputed = ih.transform(df)
- Parameters:
- Variables:
iqr_thresholds (Optional[Dict[str, Dict[str, float]]]) – thresholds used to mark outliers as NaN.
imputation_handler (Optional[MissingHandler]) – fitted MissingHandler instance.
- fit(data: DataFrame) ImputationHandler
Compute IQR thresholds and fit a MissingHandler on the NaN-marked data.
- Parameters:
data (pd.DataFrame) – DataFrame used to compute thresholds and to fit imputer.
- Returns:
self (fitted ImputationHandler).
- Return type:
- class proqsar.Preprocessor.Outlier.univariate_outliers.UnivariateOutliersHandler(activity_col: str | None = None, id_col: str | None = None, select_method: str = 'uniform', imputation_strategy: str = 'mean', missing_thresh: float = 40.0, n_neighbors: int = 5, save_method: bool = False, save_dir: str | None = 'Project/OutlierHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)
Bases:
BaseEstimator,TransformerMixinHigh-level univariate outlier handler.
This class detects features with univariate outliers (via _feature_quality) and applies one of several handling strategies only to those features:
‘iqr’ : remove rows outside IQR thresholds (IQRHandler)
‘winsorization’ : cap values at thresholds (WinsorHandler)
‘imputation’ : set outliers to NaN and impute (ImputationHandler)
‘power’ : PowerTransformer()
‘normal’ : QuantileTransformer(output_distribution=’normal’)
‘uniform’ : QuantileTransformer(output_distribution=’uniform’)
- Typical usage:
uoh = UnivariateOutliersHandler(select_method=’iqr’, id_col=’id’) uoh.fit(df) df_out = uoh.transform(df)
- Parameters:
activity_col (Optional[str]) – Optional column name for activity/target to exclude from detection.
id_col (Optional[str]) – Optional column name for identifiers to exclude from detection.
select_method (str) – Chosen method key (one of the supported methods).
imputation_strategy (str) – Strategy forwarded to ImputationHandler when used.
missing_thresh (float) – Missing percent threshold forwarded to ImputationHandler.
n_neighbors (int) – KNN neighbors forwarded to ImputationHandler when used.
save_method (bool) – If True, saves the fitted handler as a pickle in save_dir.
save_dir (Optional[str]) – Directory used for saving pickles/CSVs.
save_trans_data (bool) – If True, transformed data is saved to CSV.
trans_data_name (str) – Filename base for saving transformed CSV.
deactivate (bool) – If True, fit/transform become no-ops and input is returned unchanged.
- static compare_univariate_methods(data1: DataFrame, data2: DataFrame | None = None, data1_name: str = 'data1', data2_name: str = 'data2', activity_col: str | None = None, id_col: str | None = None, methods_to_compare: List[str] = None, save_dir: str | None = 'Project/OutlierHandler') DataFrame
Compare a set of univariate outlier handling methods by applying each to data1 and (optionally) data2 and summarizing how many rows remain / are removed.
- Parameters:
data1 (pd.DataFrame) – Primary DataFrame to evaluate methods on.
data2 (Optional[pd.DataFrame]) – Optional secondary DataFrame to evaluate with the same fitted handlers.
data1_name (str) – Label used for dataset1 in the output table.
data2_name (str) – Label used for dataset2 in the output table.
activity_col (Optional[str]) – Optional activity/target column to exclude from detection.
id_col (Optional[str]) – Optional ID column to exclude from detection.
methods_to_compare (List[str]) – List of method keys to compare. Defaults to all supported methods.
save_dir (Optional[str]) – If provided, the comparison table CSV will be saved here.
- Returns:
DataFrame summarizing for each method and dataset the row counts before/after handling.
- Return type:
pd.DataFrame
- Raises:
Exception – Propagates exceptions encountered during comparison.
- fit(data: DataFrame, y=None) UnivariateOutliersHandler
Detect bad features and fit the selected outlier handling strategy.
- Parameters:
data (pd.DataFrame) – Input DataFrame used to detect bad features and to fit the chosen handler.
y (Optional[pd.Series]) – Ignored; present for sklearn compatibility.
- Returns:
self (fitted UnivariateOutliersHandler).
- Return type:
- Raises:
ValueError – If an unsupported select_method is provided.
Exception – Propagates unexpected exceptions.
- fit_transform(data: DataFrame, y=None) DataFrame
Fit the handler and immediately transform the provided data.
- Parameters:
data (pd.DataFrame) – DataFrame to fit & transform.
y (Optional[pd.Series]) – Ignored; present for sklearn compatibility.
- Returns:
Transformed DataFrame.
- Return type:
pd.DataFrame
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') UnivariateOutliersHandler
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') UnivariateOutliersHandler
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- transform(data: DataFrame) DataFrame
Apply the fitted outlier handler to the detected bad features.
Only the columns flagged in self.bad are transformed; the rest of the DataFrame is preserved. If the chosen handler returns a numpy array, it is coerced into a DataFrame with the original column names.
- Parameters:
data (pd.DataFrame) – DataFrame to transform.
- Returns:
Transformed DataFrame with outlier handling applied.
- Return type:
pd.DataFrame
- Raises:
NotFittedError – If the handler has not been fitted.
Exception – Propagates unexpected exceptions during transformation.
- class proqsar.Preprocessor.Outlier.univariate_outliers.WinsorHandler
Bases:
objectHandler that applies Winsorization (capping) using IQR thresholds.
- Typical use:
wh = WinsorHandler() wh.fit(df) df_capped = wh.transform(df)
- Variables:
iqr_thresholds (Optional[Dict[str, Dict[str, float]]]) – dictionary of thresholds created during fit.
- fit(data: DataFrame) WinsorHandler
Compute and store IQR thresholds.
- Parameters:
data (pd.DataFrame) – DataFrame used to compute thresholds.
- Returns:
self (fitted handler).
- Return type:
- class proqsar.Preprocessor.Outlier.kbin_handler.KBinHandler(activity_col: str | None = None, id_col: str | None = None, n_bins: int = 3, encode: str = 'ordinal', strategy: str = 'quantile', save_method: bool = False, save_dir: str | None = 'Project/KBinHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)
Bases:
BaseEstimator,TransformerMixinDiscretize features identified as univariate outliers using
sklearn.preprocessing.KBinsDiscretizer.This handler detects “bad” features via
_feature_quality(), fits a KBinsDiscretizer, and replaces bad features with binned columns (Kbin1, Kbin2, …).Typical usage:
>>> kbin = KBinHandler(activity_col="activity", id_col="id", n_bins=3) >>> kbin.fit(df) >>> transformed = kbin.transform(df)
- Parameters:
activity_col (Optional[str]) – Name of the activity/target column (if present).
id_col (Optional[str]) – Name of the identifier column (if present).
n_bins (int) – Number of bins to produce. Default is 3.
encode (str) – Encoding strategy {“ordinal”,”onehot”,”onehot-dense”}. Default is “ordinal”.
strategy (str) – Binning strategy {“uniform”,”quantile”,”kmeans”}. Default is “quantile”.
save_method (bool) – If True, save fitted handler as pickle.
save_dir (Optional[str]) – Directory to save pickled handler / CSV outputs. Default is “Project/KBinHandler”.
save_trans_data (bool) – If True, save transformed data to CSV.
trans_data_name (str) – Base filename for saving transformed CSV. Default is “trans_data”.
deactivate (bool) – If True, disable handler and return inputs unchanged.
- Variables:
kbin (Optional[KBinsDiscretizer]) – Fitted
KBinsDiscretizerafterfit(), orNone.bad (list[str]) – Names of detected univariate outlier features.
transformed_data (pandas.DataFrame) – Stores the last transformed DataFrame.
- fit(data: DataFrame, y=None) KBinHandler
Detect univariate outliers and fit KBinsDiscretizer on them.
- Steps:
Call
_feature_quality()to detect “bad” features.If any, fit
KBinsDiscretizeron those columns.Optionally save fitted handler as pickle.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to fit on.
y (Any) – Ignored, present for sklearn compatibility.
- Returns:
Fitted handler (self).
- Return type:
- Raises:
Exception – If fitting fails unexpectedly.
- fit_transform(data: DataFrame, y=None) DataFrame
Fit KBinsDiscretizer on bad features and transform in one call.
- Parameters:
data (pandas.DataFrame) – Input DataFrame to fit and transform.
y (Any) – Ignored, present for sklearn compatibility.
- Returns:
Transformed DataFrame with discretized features.
- Return type:
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') KBinHandler
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') KBinHandler
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- transform(data: DataFrame) DataFrame
Apply fitted KBinsDiscretizer to detected bad features.
If deactivated → return input unchanged.
If no bad features detected → return input unchanged.
Otherwise → replace bad features with new columns (“Kbin1”, “Kbin2”, …).
- Parameters:
data (pandas.DataFrame) – Input DataFrame to transform.
- Returns:
Transformed DataFrame with discretized columns.
- Return type:
- Raises:
Exception – If transformation fails unexpectedly.
- class proqsar.Preprocessor.Outlier.multivariate_outliers.MultivariateOutliersHandler(activity_col: str | None = None, id_col: str | None = None, select_method: str = 'LocalOutlierFactor', n_jobs: int = 1, random_state: int | None = 42, save_method: bool = False, save_dir: str | None = 'Project/MultivOutlierHandler', save_trans_data: bool = False, trans_data_name: str = 'trans_data', deactivate: bool = False)
Bases:
BaseEstimator,TransformerMixinDetect and remove multivariate outliers from tabular datasets.
The handler supports several algorithms for multivariate outlier detection:
"LocalOutlierFactor""IsolationForest""OneClassSVM""RobustCovariance"(EllipticEnvelope with contamination=0.1)"EmpiricalCovariance"(EllipticEnvelope with support_fraction=1)
Outliers are identified during
fit()and removed duringtransform().- Parameters:
activity_col (Optional[str]) – Name of the activity/target column to ignore when fitting.
id_col (Optional[str]) – Name of the identifier column to ignore when fitting.
select_method (str) – Algorithm name to use for detection. One of {“LocalOutlierFactor”,”IsolationForest”,”OneClassSVM”, “RobustCovariance”,”EmpiricalCovariance”}.
n_jobs (int) – Number of parallel jobs (where supported). Default is 1.
random_state (Optional[int]) – Random seed for reproducibility where applicable. Default is 42.
save_method (bool) – If True, save the fitted handler as a pickle.
save_dir (Optional[str]) – Directory to store pickled handler / transformed data. Default is “Project/MultivOutlierHandler”.
save_trans_data (bool) – If True, save transformed DataFrame to CSV.
trans_data_name (str) – Base filename for saving transformed CSV.
deactivate (bool) – If True, disables the handler. Methods become no-ops.
- Variables:
multi_outlier_handler (object | None) – The fitted estimator instance, or None.
features (pandas.Index | None) – List of feature column names used in fitting.
data_fit (pandas.DataFrame) – The feature matrix used at fit time.
transformed_data (pandas.DataFrame | None) – Last transformed DataFrame.
- static compare_multivariate_methods(data1: DataFrame, data2: DataFrame | None = None, data1_name: str = 'data1', data2_name: str = 'data2', activity_col: str | None = None, id_col: str | None = None, methods_to_compare: List[str] | None = None, save_dir: str | None = 'Project/OutlierHandler') DataFrame
Compare multiple outlier detection methods across datasets.
- Parameters:
data1 (pandas.DataFrame) – Primary dataset.
data2 (Optional[pandas.DataFrame]) – Optional second dataset for evaluation.
data1_name (str) – Label for dataset1 in results.
data2_name (str) – Label for dataset2 in results.
activity_col (Optional[str]) – Activity/target column name to exclude.
id_col (Optional[str]) – Identifier column name to exclude.
methods_to_compare (Optional[List[str]]) – List of algorithms to compare. If None, defaults to all.
save_dir (Optional[str]) – If set, saves comparison results CSV to this directory.
- Returns:
Summary table with rows removed for each method and dataset.
- Return type:
- Raises:
Exception – If comparison fails unexpectedly.
- fit(data: DataFrame, y=None) MultivariateOutliersHandler
Fit the selected outlier detector on the given dataset.
- Parameters:
data (pandas.DataFrame) – Input DataFrame containing features and optional id/activity columns.
y (Any) – Ignored (sklearn API compatibility).
- Returns:
Fitted handler (self).
- Return type:
- Raises:
ValueError – If select_method is not supported.
Exception – If fitting fails unexpectedly.
- fit_transform(data: DataFrame, y=None) DataFrame
Fit the outlier detector and immediately transform the data.
- Parameters:
data (pandas.DataFrame) – Input dataset to fit and filter.
y (Any) – Ignored (sklearn API compatibility).
- Returns:
Transformed DataFrame with outliers removed.
- Return type:
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') MultivariateOutliersHandler
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') MultivariateOutliersHandler
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- transform(data: DataFrame) DataFrame
Remove rows flagged as outliers.
For
LocalOutlierFactor, supports both in-sample and novelty detection.For other estimators, uses
predict()with outliers = -1.
- Parameters:
data (pandas.DataFrame) – DataFrame with the same feature columns as used in
fit().- Returns:
DataFrame with outlier rows removed.
- Return type:
- Raises:
Model
Feature Selector
- class proqsar.Model.FeatureSelector.feature_selector.FeatureSelector(activity_col: str = 'activity', id_col: str = 'id', select_method: str | List[str] | None = None, add_method: dict | None = None, cross_validate: bool = True, save_method: bool = False, save_trans_data: bool = False, trans_data_name: str = 'trans_data', save_dir: str | None = 'Project/FeatureSelector', n_jobs: int = 1, random_state: int | None = 42, deactivate: bool = False, **kwargs)
Bases:
CrossValidationConfig,BaseEstimatorPipeline component for feature selection.
This class wraps multiple feature-selection strategies and provides an estimator-like interface, making it compatible with scikit-learn pipelines.
- Key behaviors:
If
select_methodis a list (or None) andcross_validate=True, evaluates candidate selectors with repeated CV and selects the best one based onscoring_target.If
select_methodis a string, directly fits the corresponding selector.Provides
fit,transform,fit_transformandset_paramsmethods.Supports saving fitted models and transformed datasets.
- Parameters:
activity_col (str) – Column name for the target variable. Default is
"activity".id_col (str) – Column name for record identifiers. Default is
"id".select_method (Optional[Union[str, List[str]]]) – Method name or list of method names. If None, all methods are compared.
add_method (Optional[dict]) – Extra methods to add to the method map (name → selector instance).
cross_validate (bool) – If True, compare candidate methods with CV. Default is
True.save_method (bool) – If True, save the fitted FeatureSelector object as pickle. Default is
False.save_trans_data (bool) – If True, save transformed datasets to CSV. Default is
False.trans_data_name (str) – Base filename for transformed datasets. Default is
"trans_data".save_dir (Optional[str]) – Directory for saving models and transformed data. Default is
"Project/FeatureSelector".n_jobs (int) – Number of parallel jobs for supported estimators. Default is
1.random_state (Optional[int]) – Random seed for reproducibility. Default is
42.deactivate (bool) – If True, disables feature selection (fit/transform skipped). Default is
False.kwargs – Additional arguments forwarded to
CrossValidationConfig.
Example
import pandas as pd from proqsar.FeatureSelector.feature_selector import FeatureSelector df = pd.DataFrame({ "id": [1, 2, 3, 4, 5], "feature1": [0.1, 0.2, 0.3, 0.4, 0.5], "feature2": [5, 4, 3, 2, 1], "activity": [0, 1, 0, 1, 0] }) selector = FeatureSelector( activity_col="activity", id_col="id", select_method=["Anova", "MutualInformation"], cross_validate=True ) selector.fit(df) df_transformed = selector.transform(df) print(df_transformed.head())
- fit(data: DataFrame) FeatureSelector
Fit feature selector(s) on the dataset.
- Parameters:
data (pd.DataFrame) – Input DataFrame containing features, id column, and activity column.
- Returns:
Self, with fitted selector and optional CV report.
- Return type:
- Raises:
ValueError – If
select_methodis invalid or not recognized.AttributeError – If a list of methods is provided without
cross_validate=True.Exception – For unexpected runtime errors.
- fit_transform(data: DataFrame) DataFrame
Fit to the dataset, then transform it.
- Parameters:
data (pd.DataFrame) – Input DataFrame.
- Returns:
Transformed dataset.
- Return type:
pd.DataFrame
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') FeatureSelector
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_params(**kwargs) FeatureSelector
Update attributes with provided keyword arguments.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') FeatureSelector
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter intransform.
Returns
- selfobject
The updated object.
- transform(data: DataFrame) DataFrame
Transform dataset using the fitted selector.
- Parameters:
data (pd.DataFrame) – Input DataFrame to transform.
- Returns:
Transformed DataFrame with selected features and preserved id/activity columns.
- Return type:
pd.DataFrame
- Raises:
NotFittedError – If
fithas not been called before.Exception – For unexpected runtime errors.
Model Developer
- class proqsar.Model.ModelDeveloper.model_developer.ModelDeveloper(activity_col: str = 'activity', id_col: str = 'id', select_model: str | List[str] | None = None, add_model: dict = {}, cross_validate: bool = True, save_model: bool = False, save_pred_result: bool = False, pred_result_name: str = 'pred_result', save_dir: str | None = 'Project/ModelDeveloper', n_jobs: int = 1, random_state: int | None = 42, **kwargs)
Bases:
CrossValidationConfig,BaseEstimatorWrapper for model selection, cross-validated evaluation, model fitting and prediction.
- This class:
infers the task type (classification/regression) from the data,
constructs a default model map (mergeable with
add_model),optionally cross-validates candidate models and selects the best one,
fits the selected model on the full provided dataset,
exposes
predictto create a predictions DataFrame,optionally saves the fitted ModelDeveloper instance or prediction results.
- Parameters:
activity_col (str) – Column name for the target variable.
id_col (str) – Column name for the identifier column.
select_model (Optional[Union[str, List[str]]]) – Name of the model to use or a list of candidate names to evaluate. If
Noneandcross_validate=True, all models in the map are compared.add_model (dict) – Additional models to include in the model map (name -> estimator or (estimator, …)).
cross_validate (bool) – Whether to run cross-validation to select among candidate models.
save_model (bool) – If True, save the fitted ModelDeveloper object (pickle) to
save_dir.save_pred_result (bool) – If True, save prediction results to CSV when
predictis called.pred_result_name (str) – Filename (without directory) for saved prediction results.
save_dir (Optional[str]) – Directory for saving model/prediction files.
n_jobs (int) – Number of parallel jobs passed to underlying estimators.
random_state (Optional[int]) – Random seed for reproducible estimators.
kwargs – Forwarded to CrossValidationConfig for CV-related parameters
(e.g., n_splits, scoring_target, scoring_list). :type kwargs: dict
- fit(data: DataFrame) ModelDeveloper
Fit (or select and fit) the model on the provided dataset.
- Behavior:
Infers task type and CV strategy,
Builds the model map merged with
add_model,If
select_modelis None or a list andcross_validateis True, runs cross-validation to select the best model and fits it on full data.If
select_modelis a string, fits that model directly and optionally runs CV.Saves the fitted ModelDeveloper instance if
save_modelis True.
- Parameters:
data (pd.DataFrame) – DataFrame containing features and the activity/id columns.
- Returns:
The fitted ModelDeveloper instance.
- Return type:
- Raises:
Exception – Any unexpected exception is logged and re-raised.
- predict(data: DataFrame) DataFrame
Generate predictions for the provided data using the fitted model.
The method returns a DataFrame that always contains the id column and a ‘Predicted value’ column, and includes the true activity values if available. For classification tasks, probability columns for each class are also included.
- Parameters:
data (pd.DataFrame) – DataFrame containing features and id/activity columns.
- Returns:
DataFrame with prediction results and optionally saved to CSV if
save_pred_resultis True.- Return type:
pd.DataFrame
- Raises:
NotFittedError – If
fithas not been called and the internal model is not present.Exception – Any unexpected exception is logged and re-raised.
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') ModelDeveloper
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter infit.
Returns
- selfobject
The updated object.
- set_params(**kwargs)
Update attributes of the ModelDeveloper instance.
Only existing attributes may be updated; unknown keys raise KeyError. Returns self to allow fluent chaining.
- set_predict_request(*, data: bool | None | str = '$UNCHANGED$') ModelDeveloper
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
dataparameter inpredict.
Returns
- selfobject
The updated object.
Optimizer
- class proqsar.Model.Optimizer.optimizer.Optimizer(activity_col: str = 'activity', id_col: str = 'id', select_model: List[str] | None = None, scoring: str | None = None, param_ranges: Dict[str, Dict[str, Any]] = {}, add_model: Dict[str, Tuple[Any, Dict[str, Any]]] = {}, n_trials: int = 50, n_splits: int = 5, n_repeats: int = 2, n_jobs: int = 1, random_state: int = 42, study_name: str = 'my_study', deactivate: bool = False)
Bases:
BaseEstimatorOptimize hyperparameters for one or more candidate models using Optuna.
- The Optimizer supports:
specifying which models to search over (select_model),
custom parameter ranges for each model (param_ranges),
adding custom models (add_model: mapping model_name -> (estimator, param_ranges)),
repeated cross-validation for robust scoring,
retrieving best parameters and score after optimization.
- Parameters:
activity_col (str) – Column name for the target variable (default: “activity”).
id_col (str) – Column name for the identifier column (default: “id”).
select_model (list[str] | None) – Optional list of model names to evaluate. If None, the default model list for the detected task will be used.
scoring (str | None) – Scoring metric name used by sklearn (e.g., ‘f1’, ‘r2’). If None, defaults to ‘f1’ for classification and ‘r2’ for regression.
param_ranges (dict) – Mapping model_name -> parameter ranges used by the trial sampler. Example: {“RandomForestClassifier”: {“n_estimators”: (50,200)}}.
add_model (dict) – Mapping of custom models to add. Expected format: {name: (estimator_instance, param_range_dict)}.
n_trials (int) – Number of Optuna trials to run (default: 50).
n_splits (int) – Number of CV folds (default: 5).
n_repeats (int) – Number of CV repeats (default: 2).
n_jobs (int) – Number of parallel jobs passed to cross_val_score and some estimators.
random_state (int) – Random seed used for reproducibility (default: 42).
study_name (str) – Optuna study name / storage key base (default: ‘my_study’).
deactivate (bool) – If True, optimization is skipped and the instance is returned as-is.
- get_best_params() Dict[str, Any]
Return the best hyperparameter dictionary found by the last optimize() call.
- Returns:
Best parameters dictionary.
- Return type:
Dict[str, Any]
- Raises:
AttributeError – If optimize() has not been run and best_params is not set.
- get_best_score() float
Return the best cross-validated score found by the last optimize() call.
- Returns:
Best cross-validated score.
- Return type:
- Raises:
AttributeError – If optimize() has not been run and best_score is not set.
- optimize(data: DataFrame) Tuple[Dict[str, Any], float] | Optimizer
Run the Optuna optimization process to find the best hyperparameters.
- Steps:
Infer task type and CV splitting strategy.
Build the list of candidate models (either user-provided or the default from _get_model_list).
Define an Optuna objective that samples model name (if multiple) and hyperparameters, sets them on the model, and evaluates via cross_val_score using the configured CV splitter.
Create or load an Optuna study (SQLite storage ‘example.db’) and run the specified number of trials.
Store best_params and best_score on the instance and return them.
- Parameters:
data (pd.DataFrame) – DataFrame containing feature columns and the activity/id columns.
- Returns:
(best_params, best_score) tuple on success or self if deactivated.
- Return type:
- Raises:
Exception – Any unexpected exceptions are logged and re-raised.
Automation
Pipeline
Inference
Inference-focused runner that prepares inputs, calls a prediction pipeline, and writes results back in-place by default.
- The runner stores light metadata after each run:
last_input_df: full input DataFrame after prediction (deep-copied when possible)
last_preds: DataFrame or Series-like predictions captured from the pipeline
last_run_time, last_n, last_prediction_summary
- The pretty __repr__ produces a concise box showing inference statistics:
prediction mean/std/quantiles
Applicability Domain (AD) counts if present in the input frame
largest Prediction Interval (PI) range if PI lower/upper columns exist
top / bottom K predicted items (shows SMILES if available)
- param pipeline:
Object exposing required attributes and method: id_col, smiles_col, activity_col, and a callable predict(df, alpha=…) which returns a DataFrame, Series/array-like, or mapping of prediction values.
- type pipeline:
object
- param inplace:
If True and the provided input is a pandas DataFrame, mutate it in-place. If False a copy is used and returned. Default: True.
- type inplace:
bool
- param alpha:
Default alpha forwarded to pipeline.predict. Default: 0.05.
- type alpha:
float
- param logger:
Optional logger to use for exceptions and debug messages. If None, the module logger is used.
- type logger:
logging.Logger | None