Introduction to Data Science : Skills Required

Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

Big Data Analytics or Data Science is a very common term in IT industry because everyone knows this is some fancy term which is gonna help us to deal with this huge amount of data we are generating these days.

Let’s find out what the skills required are:

datascience

  1. Math Skills:
    • Multivariable Calculus & Linear Algebra: These two things are very important as they help us in understanding various machine learning algorithms which plays an important role in Data science.
    • Probability & Statistics : Understanding of Statistics is very important as this is the branch of Data analysis. Probability theory is also important to statistics and it is mentioned as a prerequisite for learning machine learning.
  2. Programming Skills:
    • Programming Knowledge: You need to have a good grasp on programming concepts such as
      Data structures and algorithms. Languages used are python, R, Java, Scala. C++ is also used in some places where performance is extremely important.
    • Relational Databases : You need to know databases such as SQL or Oracle so that you can fetch the required data from them whenever needed.
    • Non Relational Databases : These are of many types but mostly used types are :
      i) Column: Cassandra, HBase
      ii) Document : MongoDB, CouchDB
      iii) Key value: Redis, Dynamo
    • Distributed Computing: It is one of the most important skills to handle a large amount of data because we cannot process this much data on a single system. Tools which mainly used are Apache Hadoop and Spark. It has two main parts : HDFS i.e Hadoop Distributed File System which is used for storing data over a distributed file system. The other part is map-reduce by which we process data. We can write map reduce in programs in java or python. There are many other tools also such as PIG, HIVE.
    • Machine Learning: It is one of the most important parts of data science and the most hot topic of research among researchers so every year new developments are made in this. You at least need to know common algorithms of supervised and unsupervised learning. There are many libraries available in python and R.

      List of Python Libraries :
      i) Basic Libraries: NumPy, SciPy, Pandas, Ipython, matpolib
      ii) Libraries for Machine Learning: scikit-learn, Theano, TensorFlow
      iii) Libraries for Data Mining & Natural Language Processing: Scrapy, NLTK, Pattern

  3. Domain Knowledge
    Mostly people ignore this thinking its not important but it is very very important. The whole purpose of data science is to extract useful insights from that data so that it can beneficial to company’s business. If you don’t understand the business side of your company that how your company’s business model works and how you can’t make it better than you are of no use to the company. You need to understand how to ask right questions from right people so that you can get the valuable information you need to extract the information you need. There are some visualization tools used on this business end such as Tableau which helps you display your useful results in proper nontechnical format such as graphs or pie charts which business people can understand.




Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.

Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.

Recommended Posts:



0 Average Difficulty : 0/5.0
No votes yet.