ON THE IMPOSSIBILITY OF ALL OVERLAPPING TRIPLET CODES

Nguyễn Ngọc Lương · Jun 25, 2006

Proceedings of the NATIONAL ACADEMY OF SCIENCES
Volume 43 - Number 8 - August 15, 1957
ON THE IMPOSSIBILITY OF ALL OVERLAPPING TRIPLET CODES ?IN INFORMATION TRANSFER FROM NUCLEIC ACID TO PROTEINS

BY S. BRENNER
MEDICAL RESEARCH COUNCIL UNIT FOR THE STUDY OF THE MOLECULAR STRUCTURE OF BIOLOGICAL SYSTEMS, CAVENDISH LABORATORY, CAMBRIDGE, ENGLAND
Communicated by G. Gamow, June 10, 1957

It is a generally accepted view that nucleic acids control the synthesis of proteins, and it has been proposed more specifically that the sequence of amino acids in a polypeptide chain is determined by the order of nucleotides in ribo- or deoxyribonucleic acid. The problem of how this determination is effected has come to be known as the "coding" problem. The formal aspects of this problem can be investigated theoretically, and most of the work done in this field has recently been reviewed by Gamow, Rich, and Yeas.'

Since there are only four different nucleotides in RNA or DNA to determine twenty different amino acids, it is clear that more than one nucleotide must be used to code for each amino acid. Most codes have been constructed on the basis that each amino acid is determined by a set of three nucleotides. Such triplet codes,
however, have an excess of information, since there are sixty'four different triplets for the twenty amino acids. In Gamow's original diamond code, several triplets, chosen in a particular way, coded for any given amino acid; the code was therefore "degenerate." This code was also of the overlapping type-that is, the number
of nucleotides in the nucleic acid was equal to the number of amino acids in the polypeptide chain. Gamow's diamond code does not, in fact, code for known sequences, and the same is true for the major-minor code, another overlapping triplet code, invented by L. Orgel.' These are, however, only two examples of a large
number of possible codes of this type which can be obtained by choosing different ways of degenerating the triplets. To test all of these systematically is clearly impossible, and hence it is necessary to have some general theorem about such codes.

The general overlapping triplet code has the following properties.
(i) The coding triplets are chosen from four nucleotides, A, T, C, and G, giving sixty-four different triplets.
(ii) Coding is overlapping, each triplet sharing two nucleotides with the succeeding triplet in a sequence. Thus the sequence ATCGA codes for three amino acids: ATC for the first, TCG for the second, and CGA for the third.
(iii) An amino acid may be represented by more than one triplet; that is, the sixty-four triplets are degenerated into twenty sets.

Since any dipeptide sequence is represented by a sequence of four nucleotides, there cannot be more than 256 different dipeptides. On the other hand, if all dipeptide sequences were possible, 400 would be expected. Thus overlapping codes introduce restrictions in amino acid sequences. The number of dipeptide sequences known is less than 256, and, although statistical studies have suggested that all dipeptides are likely to be found, the significance of this result has been difficult to assess.1 The sample of proteins studied is highly selected, a large number of sequences are fragmentary, and the methods used to study sequences further bias the data.

However, sufficient sequences are known to prove that it is impossible to code them with overlapping triplets. The proof is simple and does not depend on any special way of degenerating the triplets. It consists in the demonstration that sixty-four triplets are insufficient to code the known sequences. Proof: Since successive triplets share two nucleotides in common, any given triplet can be preceded by only four different triplets and succeeded by only four different triplets. In an amino acid sequence j.k.l., we call j an N-neighbor, and 1
a C-neighbor, of k. For every four different N-neighbors (or C-neighbors) or part thereof, k must have one triplet assigned to it. Thus the minimum number of triplet representations for each amino acid can be counted from a table of neighbors. The available sequences are given in the Appendix. From these sequences a
grid is constructed and the different neighbors counted for each amino acid. The number of triplets assigned to each amino acid is based on the larger number of its neighbors. These data are given in Table 1, from which it can be seen that seventy triplets would be required to code the sequences. We conclude, then, that all
overlapping triplet codes are impossible.

This result has one important physical implication. The original formulation of overlapping codes was based on the similarity of the internucleotide distance in DNA to the spacing between amino acid residues in an extended polypeptide chain. It was supposed that each amino acid was spatially related in a one-to-one way with each nucleotide on a nucleic acid template. The present result shows that this cannot be so and that each amino acid is stereochemically related to at least two, if not three, nucleotides, depending on whether coding is partially overlapping or nonoverlapping. The difficulties raised. by this call easily be overcome by assuming that the polypeptide sequence is in contact with the nucleic acid template only at the growing point, and detailed schemes can be readily proposed.

As far as the coding problem is concerned, it now appears that all amino sequences are likely to be found and that it will not be possible to effect a "decoding" by discovering restrictions in sequences. The nonoverlapping of triplets implies that there must be some way of determining which triplets in a sequence are coding triplets and which are not, and a very interesting code has recently been proposed by Crick, Griffith, and Orgel, in which this problem is dealt with in a novel manner.

Appendix

In writing the sequences, the same conventions used by Gamow et al.' have been followed. Wherever doubt exists as to whether glutamic acid is present as such (glu) or as the amide (glun), it has been assigned as "glux," and the same rule has been followed for aspartic acid and asparagine. All the longer lysozyme sequences suggested by Thompson (Biochem. J., 60, 507; 61, 253, 1955) have been omitted, since some of these appear to be incorrect when compared with those-established by the French workers. Sequences established by carboxypeptidase digestions alone are given at the end of the list but are omitted from the grid. The same applies to the pepsin sequence of Williamson and Passmann (J. Biol. Chem., 222, 151, 1956), as there are conflicting reports about the N-terminal group (Van Vunakis and Herriott, Biochim. et Biophys. Acta, 23, 60, 1957).

The grid (Table 2) shows the number of times dipeptide sequences are found. Identical sequences from the closely related proteins vasopressin and oxytocin and corticotrophin and ?melanophore-stimulating hormone are only recorded once. Dipeptide sequences from ?lysozyme are not recorded if the same sequence is found in a longer peptide. When both ?glu and glun are absent, glux is counted as a neighbor, and the same rule is followed for asp, asn, and ax.

Sequence used in the grid:

Insulin A: Gly. ileu. val. glu. glun. cys. cys. ala. ser. val'. cys. ser. leu. tyr. glun. leu. glu. asn. tyr. cys. asn

Insulin B: Phe. val. asn. glun. his. leu. cys. gly. ser. his. leu. val. glu. ala. leu. tyr. leu. val. cys. gly. glu. arg. gly. phe. phe. tyr. thr. pro. lys. ala.

Oxytocin: Cys. tyr. ileu. glun. asn. cys. pro. leu. gly.NH2.arga

Vasopressin: Cys. tyr. phe. glun. asn. cys. pro. lysb. gly. NH2.

Corticotrophin: Ser. tyr. ser. met. glu. his. phe. arg. try. gly. lys. pro. val. gly. lys. lys. argarg. pro. val. lys. val. tyr. pro. asp. gly. ala. glu.' asp. glun. leu. alab. glu. ala. phe. pro. leu. glux. ala. ser. glu. phe.

Glucagon: His. ser. glun. gly. thr. phe. thr. ser. asp. thr. ser. lys. tyr. leu. asp. ser. arg. arg.
ala. glun. asp. phe. val. glun. try. leu. met. asn. thr."

Melanophore-stimulating hornone: Asp. glub. gly. pro tilyns. sruet.. glu. his. phe. arg. try.
gly. ser. pro. pro. lys. asp

Hypertensin: Asp. arg. val. tyr. vala. his. pro. phe. his. leu.

Cytochrome c: -val. glun. lys. cys. alaa, b, e, f. glun. cys. his. thr. val. glu. lys.

Trypsinogen: Val. asp. asp. asp. asp. lys. ileu. val. gly.

Ribonuclease: Lys. glu. thr. ala. ala. ala. lys. phe. glun. arg. glu-24-M -tyr. cys. asn. glun. met. met. lys. ser. arg. asn. leu. thr. lys. asp. arg. cys. lys. asn. val. ala. cys. lys. asn. thr...

etc..(nhiều quá không liệt kê xuể)

Nguyễn Khiết Tâm · Jun 26, 2006

Anh ghi bài tập tham khảo dành cho shpt nhưng sao em ko thấy bài tập gì trong cái bài này cả?
Em đọc qua bài này một lượt, có mấy chỗ em chưa hiểu. Để mai em đọc kĩ lại rồi em hỏi anh sau (hic, buồn ngủ )

Nguyễn Ngọc Lương · Jun 26, 2006

À, là có thể nghĩ ra một bài tập cho học sinh phổ thông đấy. Em đọc kỹ rồi bàn sau.

Nguyễn Ngọc Lương · Jun 26, 2006

Chắc các bạn phổ thông có thể nhác đọc vì nó dài quá. Tôi tóm tắt lại nhé:

Vào lúc người ta phát hiện ra cấu trúc DNA thì người ta cũng biết được cách nó sao chép như thế nào. Vấn đề còn lại là làm thế nào gene mã hóa cho protein. Thứ nhất là liệu gene có phải là khuôn trực tiếp để tổng hợp nên protein hay không. Vấn đề này đã được Crick giải quyết. Vấn đề thứ hai là gene mã cho protein như thế nào? Vấn đề này được nhiều người giải quyết mà nổi bật là nhóm RNA Tie Club gồm Crick, Watson, Sydney, Gamow...v.v. Hồi đó thì người ta đã khá tin tưởng là mã di truyền là mã bộ ba (codon - do Sydney đặt ra). Người ta cũng đã xoay xở để biết được rằng có khoảng 20 amino acid chính thức (standard) và sau đó người ta biết được các amino acid hiếm (freak amino acid thực chất là các amino acid chính thức được hiệu chỉnh). Dùng Toán tổ hợp người ta biết được với mã bộ ba sẽ có 64 mã, dư để mã cho 20 amino acid, do đó sẽ có hiện tượng thoái hóa mã. Vấn đề là ở chỗ mã này được đọc riêng rẽ hay đọc chồng lên nhau:

Ví dụ: trình tự ATCG sẽ được đọc là ATC, và G...gì đấy hay sẽ được đọc thành ATC và TCG. Trường hợp sau còn gọi là cách đọc chồng (overlapping). Vấn đề ở đây là hồi đó người ta chưa biết nhiều về tRNA và mRNA nên người ta cho rằng nếu đọc mã theo kiểu riêng rẽ không chồng nhau thì sẽ không phù hợp vì khoảng cách giữa các amino acid sẽ rất xa nhau (trong khi thực tế khoảng cách giữa 2 amino acid xấp xỉ khoảng cách giữa 2 Nu). Tuy nhiên việc đọc chồng có một số sơ hở.

Sydney Brenner đã sử dụng một chút khéo léo để bác bỏ thuyết mã chồng lên nhau đó. Các bạn học sinh phổ thông cũng có thể làm được như vậy và có thể ra một đề toán Sinh học phân tử cho "học sinh giỏi" nếu khéo léo một chút.

ON THE IMPOSSIBILITY OF ALL OVERLAPPING TRIPLET CODES

Nguyễn Ngọc Lương

Administrator

Nguyễn Khiết Tâm

Senior Member

Nguyễn Ngọc Lương

Administrator

Nguyễn Ngọc Lương

Administrator

Similar threads

Facebook

Thống kê diễn đàn