Efficient interpretable prediction of protein-ligand interactions using gradient boosting models and explainable AI
Abstract
The prediction of small molecule binding affinity to protein targets is a critical step in modern drug discovery, offering the potential to accelerate the identification of effective therapeutics while reducing experimental costs. In this study, we employ the BELKA dataset, a large-scale DNA-encoded chemical library (DEL), to train machine learning models for binding affinity prediction. Using XGBoost, a tree-based gradient boosting algorithm, and extensive preprocessing and feature engineering, we develop predictive models for three protein targets: BRD4, HSA, and sEH to predict whether a given small molecule is a binder or not to one of three protein targets . The models demonstrate strong predictive capabilities, with interpretability achieved through SHAP analysis to identify molecular features driving binding predictions. Evaluation o f the BELKA test dataset reveals challenges in generalization, providing valuable insights into the complexities of predictive model l ing in drug discovery. This work highlights the promise of machine learning in advancing computational drug discovery by enabling efficient exploration of the chemical space for potential therapeutics.