Protein subchloroplast localization prediction: MultiP-SChlo

Michael Carrillo

MultiP-SChlo: multi-Label Protein Subchloroplast Localization Prediction with Chou's Pseudo Amino Acid Composition and a Novel Multi-Label Classifier

Xiao Wang, Weiwei Zhang, Qiuwen Zhang and Guo-Zheng Li

http://biomed.zzuli.edu.cn/bioinfo/multip-schlo/

Biological motivation

Plant cell

Plant cell

https://upload.wikimedia.org/wikipedia/commons/thumb/d/d8/Plant_cell_structure-en.svg/1280px-Plant_cell_structure-en.svg.png

Localization is important!

  1. Several non compatible metabolic reactions happening at the same time
  2. Proteins do not work alone. They're affected by their environment
  3. Cell must transport proteins where they're "needed"

How does transportation occur?

Through small sequences in the protein that bind to receptors that guide them

This signals are not static not unique

For investigators is necessary to know where a protein is located to understand the role of the protein in any given process

Biological experiments can be carried out to find biological evidence of these locations. However, they're expensive and not really practical

With expansion of protein sequencing new methods are needed to predict protein localization

Develop computational methods that allow investigators to predict these localizations

Computational context

Statistical classification

Machine Learning development

Identify to which a set of categories a new observation belongs based on a training set of data containing observations whose category membership is known

Examples

  • Spam classification
  • Fraud detection
  • Protein subcellular localization prediction

  • Develop mathematical models that can predict categories for new observations
  • A training set that is already classified using other methods
  • Algorithms do not understand the nature of the problem, they find relations between inputs and output
  • Necessity of well distributed, abundant data

Linear regression

Linear regression

https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/400px-Linear_regression.svg.png

  • A training set
  • A mathematical model
  • Training data is used to find parameters
  • Test how good is the model
  • Use it

Models for classification

  • Support Vector Machines
  • Neural Networks
  • Decision Trees
  • Regression methods (Linear)
  • ...

Support Vector Machines (SVM)

SVM intuition

Characteristics of the Classifier

  • Non-probabilistic
  • Binary
  • Linear
  • Acts on points in space

Available implementation

LIBSVM

http://www.csie.ntu.edu.tw/~cjlin/libsvm/#nuandone

Why this tool?

  • Not the first tool available
  • More subchloroplast localizations
  • Older datasets were too biased
  • Older tools were not multi-label
  • Specifically constructed for chloroplasts
  • UniProtKB/Swiss-Pro has 14408 chloroplast proteins, only 6955 annotated (48%)

Dataset

UniProtKB/Swiss-Prot database (release 2013_05)

  1. Only annotated (single or multiple) were used
  2. Ambiguous annotation was removed ('by similarity', 'probable', 'potential')
  3. Fragments or less than 50 aminoacid proteins were removed
  4. Proteins with ambiguous letters ('B', 'X', 'Z') were removed
  5. Reduction of redundancy and homology bias with CD-HIT

MSchlo578 was created with 578 chloroplast proteins

Dataset

Dataset

Labels and distribution

Dataset

Labels and distribution

Dataset

Feature extraction

Let's use the protein sequence

Sequences are not standard not really useful for ML

Amino Acid Composition (AAC)

A vector of 20 positions where each component represents the frequencies of each amino acid

Problem?

Ordering information is lost

Pseudo Amino Acid Composition (PseAAC)

First 20 vectors are AAC

Added ξ * λ components

In this case:

  • λ is 50
  • ξ is 6

Total of 20 + (50 * 6) = 320 components

Pseudo Amino Acid Composition (PseAAC)

λ has to be less than the length of the minimum sequence. (50)

ξ can be a subset of the following chemical properties:

  • Hydrophobicity
  • Hydrophilicity
  • Mass
  • pK (alpha-COOH)
  • pk (NH3)
  • pI (at 25C°)

Pseudo Amino Acid Composition (PseAAC)

PseACC representation

Proposed classifier

SVM based multiple classifier

SVM matrix

N = 578, M = 5

How good is this classifier?

Jackknife test

Yi is the true labels for each example, Zi is the predicted set

Evaluation formulas

Compared to AL-KNN

Compared to AL-KNN

Compared to single label classifiers

Compared to SLC

Questions?