Efficient interpretable prediction of protein-ligand interactions using gradient boosting models and explainable AI

Truong Thi Cam Mai

doi:10.52111/qnjs.2025.19110

Efficient interpretable prediction of protein-ligand interactions using gradient boosting models and explainable AI

Authors: Truong Thi Cam Mai

Journal: Quy Nhon University Journal of Science

Published: 2025/02/28

Volume/Issue: Vol. 19, Issue 1

Pages: 115-130

DOI: https://doi.org/10.52111/qnjs.2025.19110

Abstract

The prediction of small molecule binding affinity to protein targets is a critical step in modern drug discovery, offering the potential to accelerate the identification of effective therapeutics while reducing experimental costs. In this study, we employ the BELKA dataset, a large-scale DNA-encoded chemical library (DEL), to train machine learning models for binding affinity prediction. Using XGBoost, a tree-based gradient boosting algorithm, and extensive preprocessing and feature engineering, we develop predictive models for three protein targets: BRD4, HSA, and sEH to predict whether a given small molecule is a binder or not to one of three protein targets . The models demonstrate strong predictive capabilities, with interpretability achieved through SHAP analysis to identify molecular features driving binding predictions. Evaluation o f the BELKA test dataset reveals challenges in generalization, providing valuable insights into the complexities of predictive model l ing in drug discovery. This work highlights the promise of machine learning in advancing computational drug discovery by enabling efficient exploration of the chemical space for potential therapeutics.

Efficient interpretable prediction of protein-ligand interactions using gradient boosting models and explainable AI

Abstract

Links