Data Science Tools And Methodology
The popular languages used for data science are:
Python is a powerhouse language. It is an open-source tool. More than 80 per cent of data professionals use python.
Libraries and packages used in the data science field are:
For Scientific Computing:
Pandas -Data Structures and Tools
Numpy -Arrays and Matrices
Scipy -Integrals, solving differential equations, optimization, data visualisation.
Matplotlib -plots and graphs
Seaborn -heatmaps, time series, graphs.
Scikit-learn -machine learning: regression, classification
Statsmodels -explore data, estimate statistical models and perform statistical tests.
R is free software, known for its statistical computing and graphical presentation. It integrates with other languages and has 15000 packages in it. It is used for statistical analysis, data analysis, machine learning, data processing and manipulation.
SQL (Structured Query Language):
It is a query language to retrieve, replace, edit, delete data from a database. It is used to communicate with databases.
It is simple and powerful and is used to work with any database which stores large datasets.
CATEGORIES OF DATA SCIENCE TOOLS BASED ON REQUIREMENTS
The required data is decided based on identifying the factors that have an impact on the outcome and based on the analytical approach.
Data is collected from the database or APIs or any other source that is provided or sometimes data will be collected from the internet, web scraping and the data might be structured and unstructured. Based on the available data, the data required will be discussed if extra data is required or if data is unavailable, the alternative data requirements will also be decided.
The data that is collected should be understood by exploratory analysis. These processes are iterative and they are done again if necessary to make sure the data is completely available and ready for bringing insights.
Evaluation is done on the model to check if the accuracy of the predictions is on good levels. hence after evaluation if accuracy is low, the modelling is repeated with more corrections.
After evaluation, the models are sent to the third party, and it is monitored if they are working properly by ensuring continuous performance quality checks on deployed models.
Feedback on the model's predictions or insights will be given so that if there are more requirements on predictions and if there are any new insights that need to be predicted, the modelling process is repeated until the feedback on the model is satisfactory.
- Jayavarshini J
- Apr, 01 2022