Structured output learning method for multilabel molecular activity classification problem.
View the Project on GitHub hongyusu/Molecular_Classification
Structured output learning method for multilabel molecular activity classification problem.
When use the processed data, please cite the following papers:
Su, Hongyu; Rousu, Juho. Multilabel Classification through Random Graph Ensembles. Machine Learning, DOI: 10.1007/s10994-014-5465-9.
Please refer to ./data/README.md for more information about molecular activity data.
Take a look at (http://dtp.cancer.gov/index.html) to get an overview of the drug screen project.
Either process the data with your own scripts or follow the scripts and instruction.
The goad here is to better understand the structure of the data source and to validate the processed data.
Go to (http://www.ncbi.nlm.nih.gov/pcassay) and search with keywords 'DTP/NCI'.
Download the list of cellline AIDs that correspond to the celllines in DTP/NCI project and save AIDs to the file:../DTPNCI2015/otherfiles/pcassay.
For each cell line, download its data file from (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/CSV/Data/0000001_0001000.zip) according to its AID. This will be a zip file covers all cell line data files in DTP/NCI project.
So far, we have around 40,000 molecules with their activities in about 70 cell lines. There exist two problems with this data matrix in the file:./activity.txt which we should tackle.
The first problem is that we still need structural information for molecules, based on which we can measure the similaries with e.g. kernel methods.
The structural information of molecules are in the folder:./molecular_structures, where there are two types of structural information for molecules, namely, SDF files and SMI strings.
Note that we may not have a complete set of structures. Therefore, we need to filter the activity matrix against the set of molecular structures available in the folder.
The second problem is that the activity score matrix in the file:activity_complete is not complete. In other words, there area many entries with missing value. This is problemetic during learning in the later phase.
To tackle the second problem, we will organize the score matrix such that the complete information will appear in the upper left block of the whole matrix.
We are given a set of 60 cancer cell lines in the file:otherfiles/id2aid2names. These cell lines are the ones with additional information e.g. in cell miner (http://discover.nci.nih.gov/cellminer/home.do).
First, we organize the score matrix such that the first 60 columns will correspond to these 60 cell lines, and the rest will follow. In particular, we have column structure as 60 + 13.
Then we organize the rows of the score matrix such that the first ~4000 rows will correspond to the molecules that have complete data in these 60 cell lines. The rest will follow.
We notice that the activity score is computed according -10log(GI50). We still need binary value as the activity outcome for classification task. According to NCBI, a molecule is 'active' if the activity score is over(*=) 60 and 'inactive' otherwise.
To measure the similary between pair of molecules as well as to enable kernel based learning algorithms, we need to compute a m by m kernel matrix.
Kernel functions that we will be using in this project include Fingerprint-Tanimoto kernel and many graph kernels.
Figureprint-Tanimoto kernels:
The first step is to generate for each molecule a fingerprint vector.
The python script compute_fingerprint_parallel.py will compute different type of molecular fingerprints in parallel in an interactive cluster (e.g. UKKO cluster of CS Department in University of Helsinki).
There are serveral type of molecular fingerprints that can be generated by open source softwares. Here we only consider fingerprint of type fp2, fp3, and fp4 generated by Openbabel (http://openbabel.org/wiki/Tutorial:Fingerprints).
The fp2 fingerprints we use are linear fragments upto 7 atoms. We have studied fp2 fingerprint in the previous work. Fp3 and fp4 fingerprints can also be studied based on the same principle.
Read more about fingerprint in Openbabel user manual.
Given a molecular structure in SDF file, this fingerprint can be generated by OpenBabel (http://openbabel.org/wiki/Main_Page), see the link for more information.
We need to generate the fingerprints of around 40,000 molecules. It is important to parallelize the computation in a computer cluster e.g. Triton in Aalto, Ukko in University of Helsinki.
The scripts for fingerprint computation and parallelization can be found from the project folder.
The next step is to compute the Tanimoto kernel based on the fingerprints.
The perl script compute_tanimoto_kernel.pl will compute the elements in the kernel matrix. The perl script will make the pairwise comparison faster.
Tanimoto kernel function is defined on two binary bit vectors. See https://en.wikipedia.org/wiki/Jaccard_index for more information.
As the boolean operations (and, or) on two long bit vectors are not cheap. Perl or Bash programmings are recommended which allow fast processing on strings. In particular, Perl scripts can be found from project folder.
Notice that we do not need the kernel for all 40,000 molecule. We only need the kernel for the roughly 4,000 molecules which have complete activity over 60 cancer cell lines.
In the end, we only compute the kernel matrix for 5,000 molecules which will cover all molecules with complete activity over 60 cancer cell lines.
Graph kernels:
We will use open source package (http://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels.html) to compute graph kernels.
The first step is to convert each molecule into a adjacency matrix.
The python script compute_fingerprint_parallel.py will run in parallel to compute the adjacency of each molecule, and save the result as a matlab data file.
The R script run_sdf_to_matlab.r will take in a sdf file of a molecule and output an adjacency matrix in .mat file.
The Matlab script run_R_to_matlab.m will take in the adjacency matrix in .mat file and convert it to a data structure in matlab for the graph kernel package.
With all molecules in adjacency matrices in .mat file, we can compute many kinds of graph kenrels.
The Matlab script compute_graph_kernels.m will take in a parameter of the kernel type and compute the corresponding kernel matrix over 5000 molecules.
The Python script compute_graph_kernels.py will enable the kernel computation in parallel in an interactive cluster
We will compute all graph kernels available in the graph kernel code package. This will cover kernels on labeled and unlabeled graph.
Notations
m the number of examples (molecular structures)
k the number of labels (cancer cell lines)
d the number of features (fingerprints)
List of files generated from preprocessing:
./DTPNCI2014/activity_complete all molecules with all activities in all cell lines
./DTPNCI2014/activity_complete_structurefilter all molecules with structural information with all activities in all cell lines
./DTPNCI2014/activity_processed arrange molecules and activities
./DTPNCI2014/target_processed activity outcomes in the same order with activity_processed
./DTPNCI2015/results/ncicancer_activities m by k matrix of activity scores
./DTPNCI2015/results/ncicancer_aids 1 by k matrix of cell line AIDs
./DTPNCI2015/results/ncicancer_features m by d matrix of figureprint features for each molecules
./DTPNCI2015/results/ncicancer_labels m by 1 matrix of molecule NSC IDs
./DTPNCI2015/results/ncicancer_targets m by k matrix of activity outcomes
./DTPNCI2015/results/ncicancer_kernel m by m kernel matrix with fp2 fingerprint feature, m=5000
./DTPNCI2015/results/ncicancer_kernel_fpk m by m kernel matrix of other two fingerprint kernels, m=5000
./DTPNCI2015/results/ncicancer_kernel_graph_* m by m graph kernel matrix, m=5000
The kernel file is not in the GitHub due to the size limit.
These are inputs to the kernel based learning algorithms.
All kernel matrices are normalized such that values on diagonal are all one.
It is recommanded to center the kernel matrix.
List of other files during preprocessing (e.g. structure files, fingerprint files):
./structures/FPfile_list list of generated fp2 fingerprint files
./structures/FPfile_list_fp3 list of generated fp3 fingerprint files
./structures/FPfile_list_fp4 list of generated fp4 fingerprint files
./structures/FPfiles/ all fp2, fp3, and fp4 fingerprint files
./structures/MATLABfile_list list of .mat files (adjacency matrix for each molecule)
./structures/MATLABfiles/ all .mat files
./structures/sdffile_list list of .sdf structure files
./structures/sdffiles/ all .sdf structure file
List of preprocessing scripts, in the order of preprocessing:
Update activity files of NCI cancer cell lines
./preprocessing_codes/update_ncicancer_actfiles.py
Select molecules and arrange the activity score matrix
./preprocessing_codes/update_ncicancer_selections.py ./preprocessing_codes/update_ncicancer_selections.r
Compute for each molecule a fingerprint vector. The code is implemented for fp2 fingerprint.
You need to modify the code in order to computer fp3 and fp3 features.
The code is also used to generate for each molecule an adjacency matrix as .mat file which is used in graph kernel computation.
./preprocessing_codes/compute_fingerprint_parallel.py ./preprocessing_codes/run_R_to_mat.m ./preprocessing_codes/run_sdf_to_mat.r
Based on the computed fingerprints, construct Fingerprint-Tanimoto kernel. You need to modify the code in order to compute the Fingerprint-Tanimoto kernel for fp3 and fp4 features.
./preprocessing_codes/compute_fp_feature_matrix.py ./preprocessing_codes/compute_tanimoto_kernel.pl ./preprocessing_codes/complete_tanimoto_kernel_matrix.r
Based on generated adjacency matrces of molecules, compute different kinds of graph kernels.
Graph kernel computation is based on a Matlab toolbox. The Python script enables the parallel computation in an interactive computer cluster:
./preprocessing_codes/compute_graph_kernels.m ./preprocessing_codes/compute_graph_kernels.py