What is Data Science?

Data Science is an interdisciplinary field that focuses on extracting knowledge from data sets which are typically huge in amount. The field encompasses analysis, preparing data for analysis, and presenting findings to inform high-level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statics, information visualization, graphic, and business.

Solving the problem

Data is everywhere and is one of the most important features of every organization that helps a business to flourish by making decisions based on facts, statistical numbers, and trends. Due to this growing scope of data, data science came into picture which is a multidisciplinary IT field, and data scientist’s jobs are the most demanding in the 21st century. Data analysis/ Data science helps us to ensure we get answers for questions from data. Data science, and in essence, data analysis plays an important role by helping us to discover useful information from the data, answer questions, and even predict the future or the unknown. It uses scientific approaches, procedures, algorithms, the framework to extract the knowledge and insight from a huge amount of data.

Data science is a concept to bring together ideas, data examination, Machine Learning, and their related strategies to comprehend and dissect genuine phenomena with data. It is an extension of data analysis fields such as data mining, statistics, predictive analysis. It is a huge field that uses a lot of methods and concepts which belong to other fields like in information science, statistics, mathematics, and computer science. Some of the techniques utilized in Data Science encompasses machine learning, visualization, pattern recognition, probability model, data engineering, signal processing, etc.

Few important steps to help you work more successfully with data science projects:

  • Setting the research goal: Understanding the business or activity that our data science project is part of is key to ensuring its success and the first phase of any sound data analytics project. Defining the what, the why, and the how of our project in a project charter is the foremost task. Now sit down to define a timeline and concrete key performance indicators and this is the essential first step to kick-start our data initiative!
  • Retrieving data: Finding and getting access to the data needed in our project is the next step. Mixing and merging data from as many data sources as possible is what makes a data project great, so look as far as possible. This data is either found within the company or retrieved from a third party. So, here are a few ways to get ourselves some usable data: connecting to a database, using API’s or looking for open data.
  • Data preparation: The next data science step is the dreaded data preparation process that typically takes up to 80% of the time dedicated to our data project. Checking and remediating data errors, enriching the data with data from other data sources, and transforming it into a suitable format for your models.
  • Data exploration: Now that we have clean our data, it’s time to manipulate it to get the most value out of it. Diving deeper into our data using descriptive statistics and visual techniques is how we explore our data. One example of that is to enrich our data by creating time-based features, such as: Extracting date components (month, hour, day of the week, week of the year, etc.), Calculating differences between date columns or Flagging national holidays. Another way of enriching data is by joining datasets — essentially, retrieving columns from one data-set or tab into a reference data-set.
  • Presentation and automation: Presenting our results to the stakeholders and industrializing our analysis process for repetitive reuse and integration with other tools. When we are dealing with large volumes of data, visualization is the best way to explore and communicate our findings and is the next phase of our data analytics project.
  • Data modeling: Using machine learning and statistical techniques is the step to further achieve our project goal and predict future trends. By working with clustering algorithms, we can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results.

Why Data Scientist?

Data scientists straddle the world of both business and IT and possess unique skill sets. Their role has assumed significance thanks to how businesses today think of big data. Business wants to make use of the unstructured data which can boost their revenue. Data scientists analyze this information to make sense of it and bring out business insights that will aid in the growth of the business.



Python Packages for Data Science

Now, let’s get started with the foremost topic i.e., Python Packages for Data Science which will be the stepping stone to start our Data Science journey. A Python library is a collection of functions and methods that allow us to perform lots of actions without writing any code.

1. Scientific Computing Libraries:

  • Pandas It is a two dimensional size-mutable, potentially heterogeneous tabular data structure with the labeled axis. It offers data structures and tools for effective manipulation and analysis. It provides fast access to structured data.

    Example:

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import pandas as pd
      
      
    lst = ['I', 'Love', 'Data', 'Science']
    df = pd.DataFrame(lst)
      
    print(df)

    chevron_right

    
    

    Output:

    data-science-1

  • Numpy It uses arrays for its inputs and outputs. It can be extended to objects for matrices. It allows developers to perform fast array processing with minor coding changes.

    Example:

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import numpy as np
      
      
    arr = np.array ([[1, 2, 3], [4, 6, 8]])
      
    print("Array is of type: ", type(arr))
    print("No. od dimensions:", arr.ndim)
    print("Shape of array: ", arr.shape)

    chevron_right

    
    

    Output:

    Array is of type:  <class 'numpy.ndarray'>
    No. od dimensions: 2
    Shape of array:  (2, 3)
  • Scipy It is an open-source python-based library. It functions for some advanced math problems — integrals, differential equations, optimizations, and data visualizations. It is easy to use and understand as well as fast computational power.

    Example:



    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import numpy as np
    from scipy import misc
    import matplotlib.pyplot as plt
      
      
    print ("I like ", np.pi)
    face = misc.face()
    plt.imshow(face)
    plt.show()

    chevron_right

    
    

    Output:

    python-data-science-2

2. Visualization Libraries:

  • Matplotlib It provides an object-oriented API for embedding plots into applications. Each pyplot function makes some changes to a figure. It creates a figure or plotting area in a figure, plots some lines in a plotting area.

    Example:

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import matplotlib.pyplot as plt
      
      
    plt.plot([1, 2, 3, 4])
    plt.ylabel('some numbers')
      
    plt.show()

    chevron_right

    
    

    Output:

    python-matplotlib-data-science

  • Seaborn It is used for making statistical graphics. It provides a high-level interface for drawing attractive and informative graphics. It is very easy to generate in various plots such as heap maps, team series, violin plots.

    Example:

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import seaborn as sns
      
      
    sns.set()
    tips = sns.load_dataset("tips")
    sns.relplot(x = "total_bill",
                y = "tip",
                col = "time"
                hue = "smoker"
                style = "smoker",
                size = "size",
                data = tips);

    chevron_right

    
    

    Output:

    python-data-science-3

3. Algorithmic Libraries:

  • Scikit learn It provides statistical modeling including regression, classification, clustering. It is a free software machine learning library for python programming. It uses NumPy for high-performance linear algebra and array operations.

    Example:

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    from sklearn import datasets
      
    iris = datasets.load_iris( )
    digits = datasets.load_digits( )
    print(digits)

    chevron_right

    
    

    Output:



    {‘data’: array([[ 0., 0., 5., …, 0., 0., 0.],
    [ 0., 0., 0., …, 10., 0., 0.],
    [ 0., 0., 0., …, 16., 9., 0.],
    …,
    [ 0., 0., 1., …, 6., 0., 0.],
    [ 0., 0., 2., …, 12., 0., 0.],
    [ 0., 0., 10., …, 12., 1., 0.]]), ‘target’: array([0, 1, 2, …, 8, 9, 8]), ‘target_names’: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), ‘images’: array([[[ 0., 0., 5., …, 1., 0., 0.],
    [ 0., 0., 13., …, 15., 5., 0.],
    [ 0., 3., 15., …, 11., 8., 0.],
    …,
    [ 0., 4., 11., …, 12., 7., 0.],
    [ 0., 2., 14., …, 12., 0., 0.],
    [ 0., 0., 6., …, 0., 0., 0.]],

    [[ 0., 0., 0., …, 5., 0., 0.],
    [ 0., 0., 0., …, 9., 0., 0.],
    [ 0., 0., 3., …, 6., 0., 0.],
    …,
    [ 0., 0., 1., …, 6., 0., 0.],
    [ 0., 0., 1., …, 6., 0., 0.],
    [ 0., 0., 0., …, 10., 0., 0.]],

    [[ 0., 0., 0., …, 12., 0., 0.],
    [ 0., 0., 3., …, 14., 0., 0.],
    [ 0., 0., 8., …, 16., 0., 0.],
    …,
    [ 0., 9., 16., …, 0., 0., 0.],
    [ 0., 3., 13., …, 11., 5., 0.],
    [ 0., 0., 0., …, 16., 9., 0.]],

    …,

    [[ 0., 0., 1., …, 1., 0., 0.],
    [ 0., 0., 13., …, 2., 1., 0.],
    [ 0., 0., 16., …, 16., 5., 0.],
    …,
    [ 0., 0., 16., …, 15., 0., 0.],
    [ 0., 0., 15., …, 16., 0., 0.],
    [ 0., 0., 2., …, 6., 0., 0.]],

    [[ 0., 0., 2., …, 0., 0., 0.],
    [ 0., 0., 14., …, 15., 1., 0.],
    [ 0., 4., 16., …, 16., 7., 0.],
    …,
    [ 0., 0., 0., …, 16., 2., 0.],
    [ 0., 0., 4., …, 16., 2., 0.],
    [ 0., 0., 5., …, 12., 0., 0.]],

    [[ 0., 0., 10., …, 1., 0., 0.],
    [ 0., 2., 16., …, 1., 0., 0.],
    [ 0., 0., 15., …, 15., 0., 0.],
    …,
    [ 0., 4., 16., …, 16., 6., 0.],
    [ 0., 8., 16., …, 16., 8., 0.],
    [ 0., 1., 8., …, 12., 1., 0.]]]), ‘DESCR’: “.. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n————————————————–\n\n**Data Set Characteristics:**\n\n :Number of Instances: 5620\n :Number of Attributes: 64\n :Attribute Information: 8×8 image of integer pixels in the range 0..16.\n :Missing Attribute Values: None\n :Creator: E. Alpaydin (alpaydin ‘@’ boun.edu.tr)\n :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32×32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8×8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469, \n1994.\n\n.. topic:: References\n\n – C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n Graduate Studies in Science and Engineering, Bogazici University.\n – E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n – Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n Linear dimensionalityreduction using relevance weighted LDA. School of\n Electrical and Electronic Engineering Nanyang Technological University.\n 2005.\n – Claudio Gentile. A New Approximate Maximal Margin Classification\n Algorithm. NIPS. 2000.”}

  • Stats model — It is built on NumPy and SciPy. It allows users to explore data, estimate statistical models, and perform tests. It also uses Pandas for data handling and Patsy for the R-like formula interface.

    Example:

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    import numpy as np
    import statsmodels.api as sm 
    import statsmodels.formula.api as smf 
      
    dat = sm.datasets.get_rdataset("Guerry", "HistData").data 
    results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)'
                      data = dat).fit() 
      
    print(results.summary())

    chevron_right

    
    

    Output:

    python-data-science-4




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

2


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.