Predictive Modeling

Updated 10/12/2025

Build predictive models to forecast molecular properties and chemical outcomes. Upload training data, select features, train models using various algorithms, and make predictions on new molecules with confidence.

Table of Contents (Estimated reading time: 10-12 minutes)

Workflow setup
Upload training data
Select target variable
Select features
Visualizing correlations
Feature correlations
Permutation importances
Model training
Training and visualizing the model
Model inference

Workflow setup

You can start by saving a new workflow, or you will be automatically prompted to save a workflow after uploading a training file. Pressing "load workflow" will open a previously saved workflow and its associated data.

Upload training data

Upload training data. Press "upload training data" and upload a CSV containing features (descriptors, X-variables) and the target (Y-variable) that you want to build a model to be able to predict (see sample CSV below). You can optionally include SMILES strings of the molecules or other identifiers for organization purposes, but they will not be used directly for model training. The framework accepts integers or floats and will disregard string data types. If you want to use a training file that you have uploaded previously, press "load training file".

Select target variable

After uploading the training file, you will be prompted to select which column is the target variable. In this case, we are building a model to predict the ΔΔG‡ that discriminates two enantiodivergent transition states in a Suzuki-Miyaura cross-coupling reaction.

Select features

Next, you will be prompted to select the features you want to consider for the model. Unless you know you want to exclude some features, you can press select all and then deselect any columns that are not features like SMILES or molecule names.

The table of data you have uploaded can always be visualized by pressing the "Table" tab.

You can change the selected features at any time by pressing “modify selected features”.

Visualizing correlations

After selecting variables, you can press "Run Correlation Analysis" to get an idea of correlations between single feature and the target variable (left figure). The table on the right contains the top 10 spearman correlations.

Feature correlations

Clicking on a feature in the correlation table on the right will reveal other features that it is correlated with. At this point, correlated features can be removed by unchecking them if the user desires. This can reduce chances of overfitting and provide better models. It can be common that the relationship between features is not linear nor monotonic. Here is when feature correlation is less helpful and permutation importance (fitting a model to the data) provides further insight.

Permutation importances

Feature permutation importance measures how much a model's performance decreases when a single feature's values are randomly shuffled, breaking its relationship with the target variable. A large performance drop indicates the feature is important to the model, while little or no change suggests the feature contributes minimally to predictions. To compute these, click the permutation importance tab, choose the model type for which you would like to compute importances, and then press "run importance analysis".

Model training

To start building your model, scroll down to the model training section. You will need to fill out the choices in the model parameters box. You can click the information icons to give some information on each option. Below the figure, we give information on each selection the user must make.

Feature selection: Choosing "manual" will use all the features you have selected in the first section. Choosing "auto tune" will use a sequential feature selection with nested cross-validation protocol to select the optimal features that provide the best model for the number of features you specify. For feature selection number, clicking down to "auto" will use recursive feature elimination to find both the optimal number of features and which feature combination is most productive.

Notes on feature selection: Usually having less features reduces chances of overfitting and provides a more general model. In chemistry, we are often data scarce, so it can be a challenge to not overfit. A rough rule of thumb is to have about 1 feature per 10 training data points. This is going to be context and model dependent, so you will need to explore and iterate until you get the best model.

Hyperparameters: Model hyperparameters are configuration settings that control the learning process and model structure, which are set before training begins rather than learned from the data. Examples include learning rate, regularization strength, tree depth, and number of hidden layers. These parameters affect how the model learns but aren't directly optimized during training. Selecting "default" will use standard model values, while selecting "auto tune" will perform a hyperparameter grid search with nested cross validation to optimize the model for your application. Note: linear regression does not have hyperparameters so always leave this at default.

Metric to optimize: The optimization metric (or loss function) defines what "good performance" means during training, guiding how the model updates its parameters to improve.

K-Fold Cross Validation:The value to set K which determines the number of folds created during the train and testing of the model.

Model: Generally, it is important to start simple and move towards increasing complexity. For this reason, linear regression should almost always be tested first. It is the simplest machine learning model. This will reduce chances of overfitting.

Linear Regression: Fits a straight line through data by finding the linear combination of features that best predicts the target, assuming a linear relationship between inputs and output.
Elastic Net Regression: Combines L1 (Lasso) and L2 (Ridge) regularization penalties to perform feature selection while handling correlated features, shrinking coefficients toward zero.
Bayesian Ridge Regression: A probabilistic approach to linear regression that uses Bayesian inference to automatically determine regularization strength and provide uncertainty estimates for predictions.
Random Forest: Creates multiple decision trees trained on random subsets of data and features, then averages their predictions.
XGBoost: An efficient gradient boosting algorithm that builds an ensemble of decision trees sequentially, where each tree corrects errors from previous trees.

Training and visualizing the model

After making the selections above, press "train model". The time it takes for this to complete can vary depending on the models and parameters chosen above.

Model prediction: This shows a parity plot, where the difference between actual values and the model's predictions are displayed.

Cross validation: Clicking cross-validation shows the performance of the model across many different train/test splits using a box plot.

Model statistics. This section provides information on the performance of train and test splits for judging model performance.

Model features. This section shows the features used in the final model.

Model hyperparameters. This section shows the hyperparameter values used in the final model.

Opening previous trained models. Click the box in the top right corner that contains an identifier code for you model and the type of model you ran. This will open a list of all the models that have been generated in this workflow. Clicking a model will open it in its trained state.

Model inference

This is where you use the model you built to make predictions on untested molecules, or where you can further validate your model with totally unseen data. Press "new inferencing data" and upload a CSV with the same structure as your training CSV, except without the target variable (we are going to predict this). You can also load in a previously uploaded inference file by pressing "load inference file".

Pressing run inferencing will make the predictions. You can visualize this data by pressing "modeled molecules" or "predicted values" and by observing the histogram illustrating the distribution of the predictions. Pressing "download predictions" will provide a CSV with all the molecules, features, and their predicted target values.