In the world of data space, the era of Big Data emerged when organizations are dealing with petabytes and exabytes of data. It became very tough for industries for the storage of data until 2010. Now when the popular frameworks like Hadoop and others solved the problem of storage, the focus is on processing the data. And here Data Science plays a big role. Nowadays the growth of data science has been increased in various ways and so one should be ready for the future by learning what data science is and how can we add value to it.
Data science means different things for different people, but at its gist, data science is using data to answer questions. This definition is a moderately broad definition, and that’s because one must say data science is a moderately broad field!
Data science is the science of analyzing raw data using statistics and machine learning techniques with the purpose of drawing conclusions about that information.
Pillars of Data Science
Usually, data scientists come from various educational and work experience backgrounds, most should be proficient in, or in an ideal case be masters in four key areas.
- Domain Knowledge
- Math Skills
- Computer Science
- Communication Skill
Most people thinking that domain knowledge is not important in data science but it is very very important. The foremost objective of data science is to extract useful insights from that data so that it can profitable to the company’s business. If you are not aware of the business side of the company that how the business model of the company works and how you can’t build it better then you are of no use for this company. You need to know how to ask the right questions from the right people so that you can perceive the appropriate information you need to obtain the information you need. There are some visualization tools used on the business end like Tableau that help you display your valuable results or insights in a proper non-technical format such as graphs or pie charts that business people can understand.
Math skill is very very important if you are landing to the data science world. If you are going to skip this part in the beginning then it’s guaranteed that you are going to return back to this section in the middle of learning. Because when you are going to apply the complex ML algorithm to build your model you must have to understand the math behind that complex algorithm. You must cover the following things before deep dive into data science. Consider it as the most important prerequisite part of data science.
- Linear Algebra, Multivariable Calculus & Optimization Technique: These three things are very important as they help us in understanding various machine learning algorithms that play an important role in Data Science.
- Statistics & Probability: Understanding of Statistics is very significant as this is a part of Data analysis. Probability is also significant to statistics and it is considered a prerequisite for mastering machine learning.
Computer science plays a major role in data science. Whether it may draw a complex chart or implement those complex machine learning algorithms it’s not possible without a programming language like Python and R. To handle the big amount of data you must have knowledge of Relational Database, SQL programming language, MongoDB, etc. Here is the list of computer science knowledge you must have.
- Programming Knowledge: One needs to have a good grasp of programming concepts such as Data structures and Algorithms. The programming languages used are Python, R, Java, Scala. C++ is also useful in some places where performance is very important.
- Relational Databases: One needs to know databases such as SQL or Oracle so that he/she can retrieve the necessary data from them whenever required.
- Non Relational Databases: There are many types of non-relational databases but mostly used types are Cassandra, HBase, MongoDB, CouchDB, Redis, Dynamo.
- Machine Learning: It is one of the most vital parts of data science and the hottest subject of research among researchers so each year new advancements are made in this. One at least needs to understand basic algorithms of Supervised and Unsupervised Learning. There are multiple libraries available in Python and R for implementing these algorithms.
- Distributed Computing: It is also one of the most important skills to handle a large amount of data because one can’t process this much data on a single system. The tools that mostly used are Apache Hadoop and Spark. The two major parts of these tolls are HDFS(Hadoop Distributed File System) that is used for collecting data over a distributed file system. Another part is map-reduce, by which we manipulate the data. One can write map-reduce in programs in Java or Python. There are various other tools such as PIG, HIVE, etc.
It includes both written and verbal communication. What happens in a data science project is after drawing conclusions from the analysis, the project has to be communicated to others. Sometimes this may be a report you send to your boss or team at work. Other times it may be a blog post. Often it may be a presentation to a group of colleagues. Regardless, a data science project always involves some form of communication of the projects’ findings. So it’s necessary to have communication skills for becoming a data scientist.