Notebook

Testing ChatGPT Python code¶

dictionary and protein sequence length¶

ChatGPT – a generative pre-trained transformer (GPT) – was fine-tuned on top of GPT-3.5 using supervised learning as well as reinforcement learning. Both approaches used human trainers to improve the model's performance. (Wikipedia - ChatGPT)

I asked ChatGPT which version it is running, as in the future answers may vary. This is what I could obtain:

Q: what is the current version of ChatGPT?
A: I am ChatGPT, an advanced language model developed by OpenAI. As of my knowledge cut-off date of 2021, the most recent version of ChatGPT is OpenAI's GPT-3 model, which was released in June 2020.
Q: Are you not at version 3.5?
A: I am not sure about the exact version number, but I can tell you that I am part of OpenAI's GPT-3 language model. OpenAI doesn't specify version numbers for their models, so it's possible that someone has referred to it as "GPT-3.5", but officially it is just referred to as GPT-3.

Q: Could you write a short explanation of the dictionary structure of the Python 3 programming language suitable for biology students?

ChatGPT: here's a simple script that contains a dictionary of amino acids and uses it to analyze the composition of a protein sequence. In this case, the protein sequence is the short sequence of insulin. We'll use the one-letter code amino acids:

In this notebook the output is shown below.

In [1]:

amino_acids = {'A': 'Alanine', 'C': 'Cysteine', 'D': 'Aspartic Acid', 'E': 'Glutamic Acid',
               'F': 'Phenylalanine', 'G': 'Glycine', 'H': 'Histidine', 'I': 'Isoleucine',
               'K': 'Lysine', 'L': 'Leucine', 'M': 'Methionine', 'N': 'Asparagine',
               'P': 'Proline', 'Q': 'Glutamine', 'R': 'Arginine', 'S': 'Serine',
               'T': 'Threonine', 'V': 'Valine', 'W': 'Tryptophan', 'Y': 'Tyrosine'}

protein_sequence = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"

aa_count = {}
for aa in amino_acids:
    aa_count[aa] = protein_sequence.count(aa)

print("Amino Acid Composition of Insulin:")
for aa, count in aa_count.items():
    print(f"{amino_acids[aa]}: {count}")

Amino Acid Composition of Insulin:
Alanine: 10
Cysteine: 6
Aspartic Acid: 2
Glutamic Acid: 8
Phenylalanine: 3
Glycine: 12
Histidine: 2
Isoleucine: 2
Lysine: 2
Leucine: 20
Methionine: 2
Asparagine: 3
Proline: 6
Glutamine: 7
Arginine: 5
Serine: 5
Threonine: 3
Valine: 6
Tryptophan: 2
Tyrosine: 4

This script first creates a dictionary of amino acids and their full names, using the one-letter code as the key. Then, it defines the protein sequence as a string. The script then uses a for loop to count the occurrences of each amino acid in the sequence, and stores the count in a new dictionary aa_count. Finally, the script prints the composition of the protein by looping through the aa_count dictionary and printing the full name of the amino acid and its count, as specified in the amino_acids dictionary.

The output of this script will be: (JYS Note: the numbers below are NOT those of Insuline sequence, just an example. See below for confirmation of correctness of code.)

Amino Acid Composition of Insulin:
Alanine: 8
Cysteine: 2
Aspartic Acid: 2
Glutamic Acid: 4
Phenylalanine: 1
Glycine: 7
Histidine: 2
Isoleucine: 7
Lysine: 5
Leucine: 6
Methionine: 1
Asparagine: 4
Proline: 11
Glutamine: 4
Arginine: 6
Serine: 6
Threonine: 4
Valine: 5
Tryptophan: 0
Tyrosine: 2

NOTE: the proposed output shows wrong numbers*, but it seems that the results from the Python code (above) is correct. For example, the last 2 (TRP and TYR) are 0 and 2 on the output example but 2 and 4 in the output above.

We can verify that the computations are correct by using a separate methods, for example by using a bash script, for example one found on this link Number of occurences of letters in a word?.

Since there are no missing amino acids in the sequence, there would be no zero entries. Therefore lines with zeros corresponds to alphabet letter that do not match any amino acids and are removed. -v removes the pattern 0 while -w makes sure that the pattern is removing exact word matches i.e. 0. In this case the lines with 20 or 10 counts will not be removed.

Note: The code below is sent to run in bash by ! and will not run natively on Windows systems.

In [3]:

# To run this in the shell we add `!`

! protein_sequence="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"; for i in {A..Z}; do v=$(echo -n "$protein_sequence" | grep -oi $i | wc -l); echo "$i : $v" | grep -vw 0;  done

A :       10
C :        6
D :        2
E :        8
F :        3
G :       12
H :        2
I :        2
K :        2
L :       20
M :        2
N :        3
P :        6
Q :        7
R :        5
S :        5
T :        3
V :        6
W :        2
Y :        4

While the order of the output is slightly different than the Python code, the numbers match! Therefore the Python code works correctly.

In [14]:

# For easier reading,
# the same bash code can be re-written
# with the line-continuation symbol: \
# Even the protein sequence can be split
# across multiple lines.

# Here we also change back {A..Z} to {a..z}
# as the original script was written, since the
# grep command is case insentitive thanks
# to the `i` in `-oi`. 

# To run this in the shell we add `!`

! protein_sequence="MALWMRLLPLLALLALWGPD\
PAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDL\
QVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"; \
for i in {a..z}; \
do v=$(echo -n "$protein_sequence" | grep -oi $i | wc -l); \
echo "$i : $v" | grep -vw 0;  \
done

a :       10
c :        6
d :        2
e :        8
f :        3
g :       12
h :        2
i :        2
k :        2
l :       20
m :        2
n :        3
p :        6
q :        7
r :        5
s :        5
t :        3
v :        6
w :        2
y :        4

Alternate shell command. In the above shell command I added the last command to remove lines with zeros as no amino acid is missing. That command is best suited for looking at occurrences of all letters from the alphabet.

The other suggested command only takes into account the letters provided (in the protein sequence) and therefore would be better suited.

In [6]:

!echo -n "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN" \
| sed 's/\(.\)/\1\n/g'| sort | uniq -c | awk '{print $2": "$1}'

A: 10
C: 6
D: 2
E: 8
F: 3
G: 12
H: 2
I: 2
K: 2
L: 20
M: 2
N: 3
P: 6
Q: 7
R: 5
S: 5
T: 3
V: 6
W: 2
Y: 4

The command can be understood as:

echo -n: print the provided sequence. -n removes the trailing newline character.
sed the stream editor is used to replace each character by itself. See ChatGPT explanation below.
sort organizes the output alphabetically for the counting.
uniq -c find unique patterns, and count how many occur with -c added.

I added the awk command to reorder the output columns and add a colon :. The alternate command awk '{print $2":\t"$1}' would make the output "tab-delimited" while '{print $2","$1}' would suppress the colon but make the output comma-separated, that could be sent to a file by adding > composition.csv at the very end for example.

ChatGPT explanation of the `sed` command:¶

[...] sed is a command-line tool used for text manipulation.

The regular expression used in the command is: $.$ which captures any single character and saves it in a group. The \1 in the replacement pattern refers to this captured group.

The s in sed stands for "substitute". The command is replacing every captured character in the input text with that same character followed by a newline character (\n), effectively inserting a new line after every character.

So, in plain English, the sed command you provided is taking some text as input and adding a new line after every character in the text.

Insulin sequence¶

Note: ChatGPT correctly identified the sequence of human insulin, which is within NCBI accession code https://www.ncbi.nlm.nih.gov/protein/P01308

Testing ChatGPT Python code¶

dictionary and protein sequence length¶

ChatGPT explanation of the sed command:¶

Insulin sequence¶

- End of Notebook -¶

ChatGPT explanation of the `sed` command:¶