Structure of Data Science Project
In this article, 5 phases of a data science project are mentioned –
- Questioning Phase:
- This is the most important phase in a data science project
- The questioning phase helps you to understand your data and decide on the type of analysis
- The results of some SQL queries would filter your data and answer your questions
- To extract data from bigger datasets, one can use distributed storage like Apache Hadoop, Spark or Flink
- There are 6 types of questions :
- Descriptive Question : A descriptive question is proposed when you are in need to analyse the characteristics of your data
- Exploratory Question : An exploratory question is proposed to find the existing patterns, trends, or relationships between your data
- Inferential Question : An inferential question cannot be answered directly, it can have multiple answers. You can arrive at your answer by looking into other set of data.
- Causal Question : A causal question is asked to make sure that changing one attribute doesn’t affect the other attributes
- Prediction Question : A predictive question are proposed when you are more dedicated in predicting the result
- Mechanistic Question : A question that asks how the action would produce the desired result
- Exploratory Data Analysis:
- EDA has two main goals-
- Check if the data you have is suitable to answer your questions
- Start to develop a sketch of the solution. This can be done without any formal modelling or statistical testing
- Formulating a question is done to initiate the exploratory data analysis process and to limit the possibilities of getting distracted from your dataset
- Now, the data should be read carefully. Mostly the data would be messy and containing irrelevant or inappropriate data. To remove unwanted data, data cleaning should be done. Sometimes, already cleaned data is also available
- Check if your dataset carries all the data that is required
- Making sure it is important that the data matches something outside of the dataset. It is simple to do external validation, just check your data against a single number.
- To plot and visualize a data is a good way to understand your data. Plotting can occur at different stages of data analysis. It also helps you by not deviating from your expectations.
- The following questions can be asked to check if you are going through your analysis
- Do you have the right data?
- Do you need other data?
- Do you have the right question?
- EDA has two main goals-
- Formal Modelling
- If your sketch works out, it means you’ve got the right data
- Write down the parameters you are trying to estimate
- If you reach this stage, doesn’t mean your data is right all the time
- Challenge your results through variety of approaches like sensitivity analysis
- Also make sure that your data and the algorithm used is reproducible because, there might arise situations when this project would be the base for another new analysis
- At this point, you’ve probably done many different analysis
- This phase is to assemble all the information you’ve got after analysis
- It helps to filter the results you’ve got
- It would be helpful if you ship your code to another cluster or self-built distributed system for tuning
- The predictive power of a model lies in its ability to generalise.
- Communication Phase
- Once the data science project is successful, the findings should be communicated to some sort of audience
- This is an essential phase because it informs the data analysis process and translates your findings into actions
- Make sure the results of your project are visualized for quick understanding
- In this phase, technical skills are not taken into consideration. The essential skill required is you need to be able to tell a clear and actionable story
Another informal phase is the decision making phase.
My Personal Notes arrow_drop_up
Please Login to comment...