This repository contains code and data for a QSAR (Quantitative Structure-Activity Relationship) drug discovery project. The goal of this project is to develop predictive models that can estimate the biological activity of chemical compounds based on their molecular structure.
- Data preprocessing and feature extraction
- Machine learning model training and evaluation
- Random Forest Regressor implementation
- Support for various molecular descriptors
- Visualization of results
- Hyperparameter tuning
- Cross-validation
- Documentation and examples
- Python 3.7+
- pandas
- scikit-learn
- numpy
- matplotlib
- seaborn
- RDKit (for cheminformatics tasks)
The dataset used in this project is a collection of psychoactive compounds in a CSV file format from https://www.kaggle.com/datasets/thedevastator/psychedelic-drug-database
File was renamed before import to QSARDrugAnalysis.csv.
This example uses SlogP as the target variable but in principle any other variable can be used. Generally, biological activity data (ie IC50, EC50, kcal/mol etc) will be used as the target variable. Currently, my research is not published yet, so I cannot share the actual dataset I am using. The dataset should contain molecular structures (e.g., SMILES strings) and their corresponding biological activity values.