Word decoding of protein amino acid sequences with availability analysis a linguistic approach

Publication

AuthorsMotomura, Kenta Fujita, Tomohiro Tsutsumi, Motosuke Kikuzato, Satsuki Nakamura, Morikazu Otaki, Joji M.

Year2012

Abstract

The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scal)