Molecular Featurization

Updated 10/12/2025

Molecular featurization is the process of converting molecular structures into numerical representations that can be used by machine learning algorithms. This fundamental step bridges the gap between chemical structures and computational models, enabling the application of machine learning to chemical problems.

Table of Contents (Estimated reading time: 8-10 minutes)

Introduction to Molecular Featurization
String Representations (SMILES and more)
Molecular Fingerprints
2D Descriptors
Physics-based Descriptors from DFT or Other Calculations
Graph and Learned Representations
Conclusion
References

Introduction to Molecular Featurization

Choice of molecular representation is critical for successful modeling and prediction of molecular properties. In short, machine learning algorithms require numerical input, but molecules exist as complex 3D structures with unique bonding patterns, electronic distributions, dynamics, and other properties. This yields a central challenge in cheminformatics: how to convert molecular structures into meaningful numerical features (descriptors) that preserve essential chemical information.

A feature vector represents a molecule as a fixed-length array of numbers, where each position corresponds to a specific molecular property or structural characteristic (Figure 1). These feature vectors serve as an input to machine learning models that learn correlations between features and/or correlations to specified target variables. When deciding on a featurization method, it is important to consider the trade-off between information content computational efficiency. Rapid-to-compute features like molecular fingerprints capture less detailed information about molecules than computationally expensive features such as those derived from quantum mechanical calculations such as density functional theory (DFT). Consequently, models built using rapid structural descriptors may require substantially more experimental data to achieve the same predictive performance as models trained using physics-based descriptors from quantum calculations. This trade-off is particularly important in chemistry, where experimental data is often limited and conducting experiments is a bottleneck.

This summary introduces some major categories of molecular featurization methods. This is not a comprehensive or in-depth summary, but a basic introduction for those that are new to cheminformatics and molecular machine learning.

Figure 1: Representative feature vectors for triphenylphosphine.

String Representations (SMILES and more)

On a practical level, string representations of molecules allow for storing, transporting, and converting large libraries of molecular structures for a variety of tasks. The SMILES representation, developed by David Weininger in the 1980s, remains the most popular string representation for cheminformatics tasks (2,3). A variety of rules encode 2D chemical structure information into strings of letters. Multiple SMILES strings can represent the same molecule (ethanol as CCO, OCC, or C(O)C), so canonicalization algorithms were created that provide a unique SMILES representation per-molecule, or Canonical SMILES (4). SMARTS (SMILES ARbitrary Target Specification) extends SMILES for pattern matching, enabling functional group searches and substructure identification. Additional string representations include InChI (IUPAC's hierarchical identifier) (5,6) and SELFIES (SELF-referencing Embedded Strings) (7). Although string representations are mostly used for conveniently handling libraries of molecular structures, they can also be used as inputs to machine learning models via deep learning methods (8).

Molecular Fingerprints

Extended Connectivity Fingerprints (ECFPs) (9), MAP4 (10), and other molecular fingerprints are a rapid way to compute binary feature vectors (meaning of 0s and 1s) that represent a molecules 2D structure. Here, we will briefly summarize the ECFP as an illustrative example.

Figure 2: Extended Connectivity Fingerprint (ECFP) illustration

In 1965, Harry Morgan introduced an algorithm for machine-readable molecular representations (11) and in 2010 this was furthered by Rogers and Hahn as the ECFP (9). At a superficial level, the algorithm systematically examines circular neighborhoods of variable radii around each atom. Atoms are assigned features based on element type, formal charge, and connectivity. Then, the algorithm iteratively expands its view around each atom. After each expansion, the data is hashed, meaning the atomic environment information is converted into numerical codes. These codes are then compressed into a standardized format - a fixed-length string of bits (typically 1024 or 2048 positions, each being either 0 or 1).

ECFP6 refers to examining environments with a diameter of 6 bonds (radius of 3 bonds), meaning it captures all structural features within 3 bonds of each central atom. In summary, fingerprints like ECFP provide a rapid way to convert 2D molecular structures into standardized numerical representations, or feature vectors, that can be used for tasks ranging from molecular similarity calculations to building predictive models. Limitations to these molecular representations are that they are not interpretable, and that they are high-dimensional. This means they often require relatively large amounts of data to be used for predictive modeling. However, the speed and simplicity by which they can be computed makes them a convenient representation for many machine learning tasks and molecular similarity calculations.

2D Descriptors

2D molecular descriptors capture structural and physicochemical properties calculable directly from 2D structure without requiring 3D coordinates or conformational information. One way to compute a set of 2D descriptors is RDKit's implementation (14) that provides over 200 descriptors spanning properties like molecular weight, atom counts, connectivity indices, Zagreb indices, partial charges, and polarizability estimates (Figure 3).

Figure 3. Representative RDKit 2D feature vector.

Although these descriptors can be more interpretable than fingerprints, they are still limited by the fact that they are obtained from 2D molecular structure and thus may not capture complex molecular property or reactivity trends.

Physics-based Descriptors from DFT or Other Calculations

Physics-based descriptors derived from Density Functional Theory or other computational chemistry calculations are more information-rich relative to featurization methods described above (15). These descriptors rely on 3D representations of molecular structure, and even different conformations of each molecule. They can include electronic properties such as orbital energies and atomic charges, as well as geometric properties representing spatial arrangements of atoms.

Figure 4. Representative triphenylphosphine physics-based feature vector obtained from DFT-level electronic and geometric calculations of 3D molecular structure.

Due to their more information-rich representation of molecules, these features generally enable predictive modeling with less data. They also yield models that are substantially more interpretable because the features are molecular properties that chemists are familiar with and often have an intuition for. However, they often come with a significantly higher computational cost both in terms of calculation time as well as requirement of additional expertise in the field of computational chemistry. Although modern computational chemistry software packages have made calculations more accessible to experimental chemists, the process of performing computational chemistry workflows for machine learning requires acting on tens to thousands of molecular structures and subsequent feature extraction from calculation output files, an additional barrier making such calculations a major undertaking. Processes involved in obtaining physics-based descriptors will be covered in another section.

Graph and Learned Representations

Molecular graph representations encode chemical structures as mathematical graphs where atoms serve as nodes, bonds as edges, with node/edge features capturing atomic and bond properties respectively. Graph neural networks can accept molecular graphs and learn a molecular representation that can be correlated to target variables for regression, classification, or unsupervised tasks (16). Other learned representations can be obtained via molecular strings, 3D coordinates, and other input formats where the machine learning model learns a molecular representation that is productive for predictive modeling (8). These approaches can be useful because they bypass the requirement for the chemist to hand select specific features such as featurization methods described above. However, because they usually rely on deep learning methods they are extremely data hungry thus limiting applications to larger datasets that may not be realisitc in many chemical research scenarios. Additionally, their interpretability can be limited because learned representations are not meaningful to a chemist.

Conclusion

Diverse molecular featurization unique advantages for specific applications in computational chemistry and drug discovery. Traditional methods like SMILES strings, ECFP fingerprints, and 2D descriptors provide interpretable, computationally efficient representations with decades of validation, while physics-based DFT descriptors offer mechanistic insights crucial for understanding reactivity patterns. Modern approaches including graph neural networks and learned representations excel at capturing complex non-linear relationships and adapting to specific tasks, though often requiring larger datasets and computational resources.

Successful molecular modeling increasingly relies on understanding the strengths and limitations of different representation methods rather than seeking universal solutions. The choice between fingerprints for virtual screening, quantum descriptors for mechanistic insight, or learned representations should align with specific research objectives, available data, and interpretability requirements. As the field evolves toward hybrid approaches combining traditional and modern methods, organic chemists equipped with knowledge of this diverse toolkit are better positioned to tackle computational challenges in drug discovery, materials science, and chemical biology. The continued development of accessible software tools and pre-trained models promises to democratize these powerful techniques, enabling broader adoption across the chemistry research community.

References

1. D. S. Wigh, J. M. Goodman, A. A. Lapkin. A review of molecular representation in the age of machine learning. Wires Comput. Mol. Sci. 2022, 12, e1603: https://doi.org/10.1002/wcms.1603.

2. D. Weininger. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36: https://pubs.acs.org/doi/10.1021/ci00057a005.

3. SMILES documentation: https://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html#INTRO

4. N. M. O'Boyle. Towards a universal SMILES representation - A standard method to generate canonical SMILES based on the InChi. J. Cheminf. 2012, 4, 22: https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-4-22.

5. SMARTS documentation: https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.

6. S. E. Stein, S. R. Heller, D. Tchekhovski. An Open Standard for Chemical Structure Representation - The IUPAC Chemical Identifier, 2003 Nimes International Chemical Information Conference Proceedings, 131–143: https://old.iupac.org/inchi/Stein-2003-ref1.html

7. IUPAC InChi: https://iupac.org/who-we-are/divisions/division-details/inchi/.

8. M. Krenn, F. Häse, A. Nigam, P. Friederich, A. Aspuru-Guzik. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 2020, 1, 045024. DOI: 10.1088/2632-2153/aba947.

9. S. Wang, Y. Guo, Y. Wang, H. Sun, J. Huang. SMILES-BERT: Large scale unsupervised pre-training for molecular property prediction. ACM-BCB, 2019. https://doi.org/10.1145/3307339.3342186.

10. D. Rodgers, M. Hahn. Extended-connectivity fingerprints. J. Chem. Inf. Model.2010, 50, 742–754: https://doi.org/10.1021/ci100050t

11. A. Capecchi, D. Probst, J. -L. Reymond. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J. Cheminform. 2020, 12, 43: https://doi.org/10.1186/s13321-020-00445-4.

12. H. L. Morgan. The generation of a unique machine description for chemical structures-A technique developed at chemical abstracts service. J. Chem. Doc. 1965, 5, 107–113: https://doi.org/10.1021/c160017a018.

13. ChemicBook blog post: https://chemicbook.com/2021/03/25/a-beginners-guide-for-understanding-extended-connectivity-fingerprints.html

14. Phyo Phyo Kyaw Zin blog post: https://drzinph.com/ecfp6-fingerprints-in-python-part-3/

15. https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html

16. M. Bursch, J. -M. Mewes, A. Hansen, S. Grimme. Best-practice DFT protocols for basic molecular computational chemistry. Angew. Chem. Int. Ed. 2022, 61, e202205735: https://doi.org/10.1002/anie.202205735.

17. P. Reiser, M. Neubert, A. Eberhard, L. Torresi, C. Zhou, C. Shao, H. Metni, C. van Hoesel, H. Schopmans, T. Sommer, P. Friederich. Graph neural networks for materials science and chemistry. Commun. Mater. 2022, 3, 93.