Ever Wondered Use of Data Science in Biotechnology
Data Science in Biotechnology
Since almost all companies now are integrating data and technology, regardless of the field. Once data science is mastered, there is an opportunity to apply in different fields including biotechnology. Hence, there is no shortage of need for candidates who have experience in Math, Statistics, and Programming.
🤔 What is Biotechnology?
It is a technology-based on Biology, it manipulates cellular and biomolecular processes to establish technologies and products that help improve our health and the health of our planet.
The biological knowledge has increased as we know much about molecular interactions at the genomic level and through the use of predictive models, we can determine the probable outcomes if exploiting the cellular realm.
There are different branches of Biotechnology:- Medical
- Animal
- Agricultural
- Environmental
To integrate data science with Biology or any field in life, learning how to code is a must.
But from where should we start?
Learning how to code could be very intimidating, especially for a person who doesn’t have a background in technical aspects and the scary assumption that the time to learn new computational skills seems to have passed.
Learning how to code from the beginning could be extremely hard and time-consuming which wouldn’t be efficient for some who spend 9 or more hours working especially since most of us are working in a highly competitive environment.
But we must understand that for someone that isn’t working in the core of it, you don’t have to be a cutting-edge programmer. As a scientist, all you need is how to automate repetitive tasks, exploit data but as time passes, we won’t be able to keep the pace. Programming experience will make us better scientists and the answers to our research questions will come faster and more efficiently.
We could start by learning one of the simplest programming languages to understand and deal with which is python, the big deal about python is its modules which is a turning point here where its modules define how we are going to use it. One of the most beneficial modules is Pandas.
What is Pandas? 🐼
Pandas is a powerful data science library for Python, developed by Wes McKinney.
Pandas is useful for multiple phases of data science workflow, including data cleaning, visualization, and exploratory data analysis.
Understanding panadas isn’t that easy but we mustn’t forget that practice makes perfect. So, learn-by-doing essentially means a more hands-on approach. Reading about code isn’t coding here means doing projects, exploring the beauty of analyzing data and knowing the needed statistical information about it, getting stuck on errors, and trying to solve them. The most important thing is to have patience through the process and try to look at the bigger picture where the struggle is real, what is the value of learning? why the learning process is so slow?
The outcome will appear whenever you find yourself able to optimize your code, make it more readable and maintainable.
Pandas rely on a data structure called DataFrame. This is like a spreadsheet and includes different built-in functions to manipulate and analyze data with little programming knowledge.
To start coding with Python and its modules we need to install each separately and import them in a Python IDE or jupyter notebook. Pandas deal with flat-file so import data of different file extensions we need to import other modules
Code snippet:
The biological knowledge has increased as we know much about molecular interactions at the genomic level and through the use of predictive models, we can determine the probable outcomes if exploiting the cellular realm.
There are different branches of Biotechnology:
- Medical
- Animal
- Agricultural
- Environmental
But from where should we start?
Learning how to code could be very intimidating, especially for a person who doesn’t have a background in technical aspects and the scary assumption that the time to learn new computational skills seems to have passed.
Learning how to code from the beginning could be extremely hard and time-consuming which wouldn’t be efficient for some who spend 9 or more hours working especially since most of us are working in a highly competitive environment.
But we must understand that for someone that isn’t working in the core of it, you don’t have to be a cutting-edge programmer. As a scientist, all you need is how to automate repetitive tasks, exploit data but as time passes, we won’t be able to keep the pace. Programming experience will make us better scientists and the answers to our research questions will come faster and more efficiently.
We could start by learning one of the simplest programming languages to understand and deal with which is python, the big deal about python is its modules which is a turning point here where its modules define how we are going to use it. One of the most beneficial modules is Pandas.
What is Pandas? 🐼
Pandas is a powerful data science library for Python, developed by Wes McKinney.Pandas is useful for multiple phases of data science workflow, including data cleaning, visualization, and exploratory data analysis.
Understanding panadas isn’t that easy but we mustn’t forget that practice makes perfect. So, learn-by-doing essentially means a more hands-on approach. Reading about code isn’t coding here means doing projects, exploring the beauty of analyzing data and knowing the needed statistical information about it, getting stuck on errors, and trying to solve them. The most important thing is to have patience through the process and try to look at the bigger picture where the struggle is real, what is the value of learning? why the learning process is so slow?
The outcome will appear whenever you find yourself able to optimize your code, make it more readable and maintainable.
Pandas rely on a data structure called DataFrame. This is like a spreadsheet and includes different built-in functions to manipulate and analyze data with little programming knowledge.
To start coding with Python and its modules we need to install each separately and import them in a Python IDE or jupyter notebook. Pandas deal with flat-file so import data of different file extensions we need to import other modules
Code snippet:
The outcome is :
Now after storing our dataset in a dataframe the next step is to explore our data one way to do this is through:
- Exploring the attributes of the columns in our dataset
- Determining the data type, we are dealing with
- Explore the first 5 rows in our dataset
- Explore the size of our dataset
For this aims that’s the line of codes needed:
After getting to know our data set now, we came to the cleaning stage then manipulated it. A few examples of the same ideas we could apply to achieve this purpose:
- we could change some columns value to the desired data type for example from string to integer or float especially if we are dealing with numbers
- we could insert zero instead of N/A value -we could fix wrong spelling of some value meaning same thing Ex: let’s say we are exploring a dataset where we aim to know the most-watched category for the sample with age between 13 and 20 we should fix movie categories written this way Romance and romance, at the end of the day they're the same, am I not right? Then we could manipulate it after cleaning the data set through slicing the data and applying statics to reach the desired output
Ex: on a random data set:
Now we came to the final part of our data analysis which is data visualization we could this through the most common module for this purpose which is matplotlib or through seaborn 📈
Ex:
The outcome would be :
Summary
We do our last station where we knew that the most important factor in mastering data analysis is practicing, and also trying to see patterns we could manipulate in our data, to this point we learned the scratch-off:
- Data importing
- Data cleaning and manipulation
- Visualizing data
References 📕
- **** https://towardsdatascience.com/pandas-for-biologists-f01c8a548b7c
- https://medium.com/mlearning-ai/data-science-best-practices-with-pandas-part-ii-5394fb8b7e4d
- https://www.discoverdatascience.org/industries/biotechnology/#:~:text=Data scientists%2C as the gurus, both genomic and lifestyle data.
- rwan sobhy hussein
- Mar, 28 2022