From 50cfeb7cc6124c37ecdbb957dd58fe553e5b462e Mon Sep 17 00:00:00 2001 From: Jouni Kettunen <jojuket@utu.fi> Date: Tue, 6 Sep 2022 18:00:38 +0000 Subject: [PATCH] newest version of the project --- bioinformaticsProjectKettunen_main.ipynb | 574 +++++++++++++++++++++++ 1 file changed, 574 insertions(+) create mode 100644 bioinformaticsProjectKettunen_main.ipynb diff --git a/bioinformaticsProjectKettunen_main.ipynb b/bioinformaticsProjectKettunen_main.ipynb new file mode 100644 index 0000000..a76bb74 --- /dev/null +++ b/bioinformaticsProjectKettunen_main.ipynb @@ -0,0 +1,574 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# <center> Analysis of the presence of cysteines in mitochondrial proteins </center>\n", + "\n", + "## <center> Bioinformatics project fall/spring 2021: </center>\n", + "\n", + "### <center> Jouni Kettunen </center>" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Research questions:\n", + "<b> Question 1: </b> Could cysteines have a specific function in mitochondrial (MT) proteins, even if they would not form disulfides in the reducing environment inside the mitochondrion?\n", + "\n", + "<b> Question 2: </b> Is the amino acid frequency of cysteine in MT proteins different from frequency in cytoplasmic proteins or extracellular proteins?\n", + "\n", + "<b> Question 3: </b> Other than Q2's locations, what other location decribing GO's exist\n", + "\n", + "<b> Question 4: </b> Find non-overlapping GO's for proteins\n", + "\n", + "<b> Question 5: </b> Find and show distribution for every Protein Existence code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### <center> Intro </center>\n", + "\n", + "<b> Determing the annotations </b>\n", + "\n", + "Protein descriptions includes annotations called Gene Ontology annotations - GO-terms - which are used to describe protein in: biological process, molecular function and cellular component. Each of the three main levels include several sub levels that are divided to sub levels and so forth. \n", + "\n", + "Cellular component describe - as name indicates - where in the cell the protein is located. Using the knowledge that goal is to identify proteins based on their location, other two main levels could be abandoned. Search could be then be refined to molecular components with using keywords “mitochondrial matrix” for mitochondrial proteins, “cytoplasm” for proteins located in cytoplasm. Finding proteins that are secreted required the knowledge that they are secreted outside the cell after synthesis in a cell; this means that keywords “extracellular region” and “extracellular space” would be relevant for searching secreted proteins. \n", + "\n", + "Conducting a Quick GO search with the keywords yielded following terms to be be used for the classification:\n", + "- Mitochondrial matrix: 'GO:0005759' or 'GO:0031980'\n", + "- Cytosol: ‘GO: 0005829’\n", + "- Secreted: 'GO:0005576' or 'GO:0005615'\n", + "\n", + "\n", + "<b> Finding the proteome </b>\n", + "\n", + "The classification was meant to be done proteome wide, meaning that one needs to find and access to the whole human proteome. Information for the Human proteome comes from a variety of sources (eg. Human Proteome Map, ProteomicsDB and The Human Proteome Project (HPP)) aiming to map the whole human proteome. The Human Proteome Organization (HUPO) announced in 2020 the human proteome to be 90.4% complete. HUPO doesn’t have the human proteome in its site, meaning that some other resource is needed to handle the data.\n", + "\n", + "To find the data one needs to access to the Universal Protein Resource (UniProt). UniProt is a database designed to hold protein sequence and annotation data and allow researchers access to the data. Accessing can be done by downloading designed datasets directly to personal use or programmatically coding the desired parts. The UniProt holds and maintains a dataset that includes the whole human proteome (https://www.uniprot.org/proteomes/UP000005640) this dataset can be downloaded as in FASTA (holding only the sequences) or in XML (holding the metadata including GO-terms). Other way to access the compressed version of the human proteome is from the FTP site that updates every eight weeks (https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/). \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### <center> Materials & Methods </center>\n", + "\n", + "<b> Accessing the human proteome (Biopython)</b>\n", + "\n", + "Accessing to the GO-terms requires a programmatical approach; the files are: too big to run in Excel/other XML-readers or don’t include GO-terms (FASTA) and manual search is nearly impossibly laborious. For the purposes of this project an extension of Python program - Biopython - was selected. The Biopython makes possible to acces and manipulate the UniProt files by parsing them in human friendly matter. Biopython also includes several ready made scripts that one can use for biological computation; one of them being Bio.SeqUtils.ProtParam module that was used for counting protein frequencies.\n", + "\n", + "Biopython doesn't have a GO-library (OBO parser), but one can install it using pip to make GO-parser a locally run library. Tjis project uses GOAtools by Tang et.al. https://github.com/tanghaibao/goatools. The GO-library can used for \"decrypting\" the GO-terms, but for this project GO-terms were taken directly from the Quick-GO so GO's didn't require any identification. Terms were collected with string mathching where a string parsed needs to match to the defined string (eg. item == 'GO:0031980') and valued with boolean condition <b> TRUE </b> to be taken into account. GO-module was used to visualize relationships between terms.\n", + "\n", + "<b> Accessing the datafile </b>\n", + "\n", + "The Biopython allows user to acces files directly over the internet or read/parse files locally. Accessing via internet requires user to determine the names of the proteins and in the case of this project, it would have been double the work. Also the human proteome is relatively large dataset to access and using online searches might have taken a long period. For these reasons and for local copy allowing creation of training set, a local approach was chosen. As the GO -searches were decided to be done locally, XML format was selected to be used as a source type for file downolad. \n", + "\n", + "XML-file is composed of a tree-like format where a main body contains several branches that branch further and have desired values as leaves. To answer the research questions, following values needed to be extracted from the XML-file: name of the protein, location described as GO-term, amino acid sequence, Uniprot ID and protein existence level. \n", + "\n", + "<b> Storing data </b>\n", + "\n", + "To meet the data-analysis needs in Python, a library named Pandas has been built making possible to build, manipulate and analyse data structures. This project uses specially Pandas series for saving parsed data and data frames that are build from the saved series. The XML-file is as described like an onion and data extraction like peeling the onion; each time that a layer matching to condition was met, the described values were ectracted from it and saved as a series object to a Python list. Once the onion peeling was completed (ie. no more proteins that meet the condition), the list holding Pandas series objects was transformed to a Pandas data frame. Pandas also enables effective sorting and filtering of the data frame.\n", + "\n", + "As Python keeps data in memory only for the ongoing session, data frame was saved for safety reasons and for future purposes so that there wouldn't be need to do GO-search everytime the project file is opened. A compressed binary formatted parquet -file was chosen for its abilites to pack data compactly, be a robust storing/sharing format and 100 % Pandas compatible when read to Python. A downside is that the file isn't human readable and Python doesn't have inbuilt extension for Parquet; one must install a Pyarrow library to hande parquet files.\n", + "\n", + "<b> Visualizing data </b>\n", + "\n", + "Pandas dataframe enables filtering to conveniently analyze and visualize data by using libraries. The chosen visualization libraries - seaborn and pyplot - are based on matplotlib and allow flexible and versatile function for visualizing data (eg. frequencies).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### <center> Results </center>" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### <center> Modules for the project </center>" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# To make sure that project runs properly, determine the location of the Python packages.\n", + "import sys\n", + "\n", + "# Use the location of Python.\n", + "sys.path.append('C:\\\\Python\\\\lib\\\\site-packages')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Biopython, input and output assorted sequence file formats for handling SeqRecord objects \n", + "import Bio.SeqIO as BSqIO\n", + "\n", + "# Biopython protein analysis module\n", + "from Bio.SeqUtils.ProtParam import ProteinAnalysis\n", + "\n", + "# non-interactive utility to download remote files from the internet (GO-parser)\n", + "import wget\n", + "\n", + "# for using operating system dependent functionality (GO-parser, file existence)\n", + "import os\n", + "\n", + "# Explain better why this module\n", + "import requests as R\n", + "\n", + "# creating series and data frames\n", + "import pandas as pd\n", + "\n", + "# data visualization\n", + "import seaborn as sns\n", + "\n", + "# data visualization\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# data handling\n", + "import numpy as np\n", + "\n", + "# data handling\n", + "import scipy.stats as stats\n", + "\n", + "# to open gz packed files\n", + "import gzip\n", + "\n", + "# for packagin the df's\n", + "import pyarrow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### <center> GO-parser" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [], + "source": [ + "# GOAtools is used in this project for handling the obo file\n", + "# for GOAtools use - pip install goatools - to install tols locally run library\n", + "# for more info see https://github.com/tanghaibao/goatools\n", + "\n", + "go_obo_url = 'http://purl.obolibrary.org/obo/go/go-basic.obo'\n", + "data_folder = os.getcwd() + '/data'\n", + "\n", + "# makes a local data folder\n", + "\n", + "# Check if we have the ./data directory already\n", + "if(not os.path.isfile(data_folder)):\n", + " # Emulate mkdir -p (no error if folder exists)\n", + " try:\n", + " os.mkdir(data_folder)\n", + " except OSError as e:\n", + " if(e.errno != 17):\n", + " raise e\n", + "else:\n", + " raise Exception('Data path (' + data_folder + ') exists as a file. '\n", + " 'Please rename, remove or change the desired location of the data path.')\n", + "\n", + "# Check if the file exists already\n", + "if(not os.path.isfile(data_folder+'/go-basic.obo')):\n", + " go_obo = wget.download(go_obo_url, data_folder+'/go-basic.obo')\n", + "else:\n", + " go_obo = data_folder+'/go-basic.obo'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import the OBO parser from GOATools\n", + "from goatools import obo_parser\n", + "# create a parser\n", + "go = obo_parser.GODag(go_obo)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### <center> GO-labels" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "# 21 GO-terms of interest to exclude the rest\n", + "# 'GO:0005576', \"GO:0005829\", 'GO:0005759', 'GO:0031980' to the BP\n", + "\n", + "goterms = [\"GO:0032588\", \"GO:0031965\", \"GO:0031966\", \"GO:0005789\", \"GO:0012506\", \"GO:0000139\",\n", + " \"GO:0032580\",\"GO:0005802\", \"GO:0005796\", \"GO:0043202\", \"GO:0005765\", \"GO:0005778\",\n", + " \"GO:0005777\", \"GO:0005741\", \"GO:0005743\", 'GO:0005576', \"GO:0005829\", 'GO:0005759', \n", + " 'GO:0031980', 'GO:0005634', 'GO:0005801']\n", + "\n", + "# dictionary to re-code the proteinExistence labels, corresbond to the Uniprot labels\n", + "prex = {\"evidence at protein level\": 1, \"evidence at transcript level\": 2, \"inferred from homology\": 3, \"predicted\": 4, \"uncertain\": 5 }\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### <center> Dataframe construction" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# empty list to hold the series objects\n", + "interests = []\n", + "\n", + "# open the gzipped human proteome\n", + "with gzip.open(\"UP000005640reviewed.xml.gz\", \"rt\") as handle:\n", + " \n", + " # iterate first layer from the XML file holding the proteome\n", + " for rec in BSqIO.parse(handle, \"uniprot-xml\"):\n", + "\n", + " # start 2nd iteration from the dbxref section that contains\n", + " # the additional info, including GO terms\n", + " for ref in rec.dbxrefs:\n", + "\n", + " # see if the iteratable item starts with GO: (in a format GO:GO)\n", + " # start from 3rd index and see if the term is in the list \n", + " if ref.startswith(\"GO:\") and ref[3:] in goterms:\n", + "\n", + " # convert sequence containing seq object to string\n", + " T = str(rec.seq)\n", + " # attach variable holding the string formatted sequence to ProteinAnalysis\n", + " X = ProteinAnalysis(T)\n", + " # count AA-frequencies and make dictionary d hold them\n", + " d = X.get_amino_acids_percent()\n", + " # keys are d[] and values come after =\n", + " d['Location'] = go[ref[3:]].name\n", + " d['Length'] = len(T)\n", + " # Re-code the protein existence, get value with key\n", + " d['PE'] = prex.get(\n", + " rec.annotations.get('proteinExistence')[0])\n", + " d['UniprotID'] = rec.id\n", + "\n", + " # append Series object to a list, object formed from the dictionary\n", + " interests.append(pd.Series(d, name=rec.name))\n", + "\n", + "# make the list of series objects to a dataframe\n", + "interestDF = pd.DataFrame(interests)\n", + "\n", + "# write the dfs to parquet form (not human readable) and compress to save space\n", + "\n", + "interestDF.to_parquet('interestDF.parquet.gzip', compression='gzip')\n", + "# restDF.to_parquet('restDF.parquet.gzip', compression='gzip')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [], + "source": [ + "# additional run to collect those proteins that aren't on the list\n", + "rest = []\n", + "\n", + "with gzip.open(\"UP000005640reviewed.xml.gz\", \"rt\") as handle:\n", + " for rec in BSqIO.parse(handle, \"uniprot-xml\"):\n", + " # proceed from here only if the id hasn't been added previously\n", + " if rec.id not in interestDF['UniprotID'].values:\n", + " # empty dict to collect the DF data\n", + " p = {}\n", + " # prevent dublicates: add to list only if the key isn't in the list of dictionaries\n", + " #if not any(p['UniprotID'] == rec.id for p in rest):\n", + " p['PE'] = prex.get(\n", + " rec.annotations.get('proteinExistence')[0])\n", + " p['UniprotID'] = rec.id\n", + " rest.append(pd.Series(p, name=rec.name))\n", + "\n", + "restDF = pd.DataFrame(rest)\n", + "\n", + "restDF.to_parquet('restDF.parquet.gzip', compression='gzip')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### <center> Prosessing data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "openInterest = pd.read_parquet('allInterest.parquet.gzip')\n", + "openRest = pd.read_parquet('restDF.parquet.gzip')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# extract the amino acid labels to be used later\n", + "AAs = openInterest.columns.values.tolist()\n", + "AAs = AAs[0:20]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# Count occurences vs. protein\n", + "\n", + "countLoc = openInterest.groupby(['UniprotID']).count()[\n", + " 'Location'].reset_index(name='Size')\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Group the data to see how many proteins are per location\n", + "groupedInterest = openInterest.groupby('Location')\n", + "\n", + "# sort values from largest to smallest \n", + "groupedInterest.size().sort_values(ascending=False)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# define the condtions for filtering the original DF\n", + "mimat = openInterest['Location'] == 'mitochondrial matrix'\n", + "minme = openInterest['Location'] == 'mitochondrial inner membrane'\n", + "gome = openInterest['Location'] == 'Golgi membrane'\n", + "ermem = openInterest['Location'] == 'endoplasmic reticulum membrane'\n", + "ereg = openInterest['Location'] == 'extracellular region'\n", + "cyto = openInterest['Location'] == 'cytosol'\n", + "nuc = openInterest['Location'] == 'nucleus'\n", + "\n", + "# make a new df by condition\n", + "toKeep = openInterest[mimat | minme | gome | ermem | ereg | cyto | nuc]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# filter the proteins that have any residue over 0.4 (40%)\n", + "# columns (axis = 1) are defined by using the AA labels\n", + "nonAA = openInterest[(openInterest[AAs]>= 0.4).any(axis=1)]\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### <center> Visualizations" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# Melt the dataframe to make it Seaborn friendly.\n", + "melted_toKeep = pd.melt(toKeep, id_vars=['Location'],\n", + " value_vars=AAs, var_name='Amino acids')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the background theme such that the gridlines are visible.\n", + "sns.set_theme(style=\"whitegrid\")\n", + "\n", + "# Adjust the figure size 40 ppt in width, 10 in heigth.\n", + "plt.figure(figsize=(40, 10))\n", + "\n", + "# Make the boxplot: set labels, data from the melted data frame, group by (hue) location, \n", + "# show outliers, select the color and the width of the outlier line.\n", + "ax = sns.boxplot(x='Amino acids', y='value', data=melted_toKeep,\n", + " hue='Location', showfliers=True, color='tomato', fliersize=5)\n", + "\n", + "# Create the legend and attach it to the desired spot on the box plot. \n", + "plt.legend(bbox_to_anchor=(1.04, 0.5), loc=\"center left\", borderaxespad=0)\n", + "\n", + "# Limit the y-axis to 0.4 to make plot more scalable.\n", + "ax.set(ylim=(0, 0.4))\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# Arange the AA's by hydrophobicity.\n", + "\n", + "# Non-polar AA labels aka. hydrophobic.\n", + "nopolab = ['A', 'I', 'L', 'M', 'F', 'W', 'Y', 'V']\n", + "\n", + "# Polar labels aka. hydrophilic.\n", + "polab = ['S', 'T', 'N', 'Q']\n", + "\n", + "# Get the nonpolar AA's and the location column from the original dataframe. \n", + "nonpolar = toKeep[['A', 'I', 'L', 'M', 'F', 'W', 'Y', 'V', 'Location']]\n", + "\n", + "# Get the polar AA's and the location column from the original dataframe.\n", + "polar = toKeep[['S', 'T', 'N', 'Q', 'Location']]\n", + "\n", + "melted_nonpolar = pd.melt(nonpolar, id_vars=['Location'],\n", + " value_vars= nopolab, var_name='Amino acids')\n", + "\n", + "melted_polar = pd.melt(polar, id_vars=['Location'],\n", + " value_vars=polab, var_name='Amino acids')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# determine the size of the image\n", + "fig, axes = plt.subplots(1,2, figsize=(80, 10))\n", + "\n", + "# give name to the image\n", + "fig.suptitle('AAs by polarity')\n", + "\n", + "# make first boxplot (sub plot) according to the conditions, for visibility set Y-axis to 0.4\n", + "ax1 = sns.boxplot(ax=axes[0], x='Amino acids', y='value', data=melted_nonpolar,\n", + " hue='Location', showfliers=True, color='tomato', fliersize=5).set(\n", + " ylim=(0, 0.4), title='Non-polar AAs')\n", + "\n", + "\n", + "# determine the legend location\n", + "plt.legend(bbox_to_anchor=(1.04, 1.5), loc=\"center right\", borderaxespad=0)\n", + "\n", + "# make second boxplot (sub plot) according to the conditions, for visibility set Y-axis to 0.4\n", + "ax2 = sns.boxplot(ax=axes[1], x='Amino acids', y='value', data=melted_polar,\n", + " hue='Location', showfliers=True, color='tomato', fliersize=5).set(\n", + " ylim=(0, 0.4), title='Polar AAs')\n", + "\n", + "\n", + "# determine the legend location\n", + "plt.legend(bbox_to_anchor=(1.04, 0.5), loc=\"center left\", borderaxespad=0)\n", + "\n", + "# show the boxplots in one image\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# make a new df holding only the C residue \n", + "cTotest = toKeep[['Location', 'C']]\n", + "\n", + "# group the df\n", + "cgroups = cTotest.groupby('Location').count().reset_index()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# NOTE: this isn't an optimal solution!\n", + "# it creates INDIVIDUAL plots instead of combining them to one\n", + "# this is also only for residue C\n", + "\n", + "# df where to start\n", + "unique_cs = cTotest['Location'].unique()\n", + "images = []\n", + "#empty list to gather the images\n", + "sns.set()\n", + "# start iterating the df\n", + "for x in unique_cs:\n", + " # and make a probability plot for each location\n", + " stats.probplot(cTotest[cTotest['Location'] == x]\n", + " ['C'], dist=\"norm\", plot=plt)\n", + "\n", + " # name the plot, x holds the name\n", + " plt.title(\"Probability Plot - \" + x)\n", + "# show the plot\n", + " plt.show()\n" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "38740d3277777e2cd7c6c2cc9d8addf5118fdf3f82b1b39231fd12aeac8aee8b" + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} -- GitLab