Implementing Web Crawler using Abstract Factory Design Pattern in Python

Last Updated : 11 Oct, 2021

In the Abstract Factory design pattern, every product has an abstract product interface. This approach facilitates the creation of families of related objects that is independent of their factory classes. As a result, you can change the factory at runtime to get a different object – simplifies the replacement of the product families.

In this design pattern, the client uses an abstract factory interface to access objects. The abstract interface separates the creation of objects from the client, which makes the manipulation easier and isolates the concrete classes from the client. However, adding new products to the existing factory is difficult because you need to extend the factory interface, which includes changing the abstract factory interface class and all its subclasses.

Let’s look into the web crawler implementation in Python for a better understanding. As shown in the following diagram, you have an abstract factory interface class – AbstractFactory – and two concrete factory classes – HTTPConcreteFactory and FTPConcreteFactory. These two concrete classes are derived from the AbstractFactory class and have methods to create instances of three interfaces – ProtocolAbstractProduct, PortAbstractProduct, and CrawlerAbstractProduct.

Since AbstractFactory class acts as an interface for the factories such as HTTPConcreteFactory and FTPConcreteFactory, it has three abstract methods – create_protocol(), create_port(), create_crawler(). These methods are redefined in the factory classes. That means HTTPConcreteFactory class creates its family of related objects such as HTTPPort, HTTPSecurePort, and HTTPSecureProtocol, whereas, FTPConcreteFactory class creates FTPPort, FTPProtocol, and FTPCrawler.

Python3

import abc
import urllib
import urllib.error
import urllib.request
from bs4 import BeautifulSoup
 
class AbstractFactory(object, metaclass=abc.ABCMeta):
    """ Abstract Factory Interface """
     
    def __init__(self, is_secure):
        self.is_secure = is_secure
 
    @abc.abstractmethod
    def create_protocol(self):
        pass
 
    @abc.abstractmethod
    def create_port(self):
        pass
 
    @abc.abstractmethod
    def create_crawler(self):
        pass
 
class HTTPConcreteFactory(AbstractFactory):
    """ Concrete Factory for building HTTP connection. """
     
    def create_protocol(self):
        if self.is_secure:
            return HTTPSecureProtocol()
        return HTTPProtocol()
 
    def create_port(self):
        if self.is_secure:
            return HTTPSecurePort()
        return HTTPPort()
 
    def create_crawler(self):
        return HTTPCrawler()
 
class FTPConcreteFactory(AbstractFactory):
    """ Concrete Factory for building FTP connection """
     
    def create_protocol(self):
        return FTPProtocol()
 
    def create_port(self):
        return FTPPort()
 
    def create_crawler(self):
        return FTPCrawler()
 
class ProtocolAbstractProduct(object, metaclass=abc.ABCMeta):
    """ An abstract product, represents protocol to connect """
     
    @abc.abstractmethod
    def __str__(self):
        pass
     
class HTTPProtocol(ProtocolAbstractProduct):
    """ An concrete product, represents http protocol """
     
    def __str__(self):
        return 'http'
 
class HTTPSecureProtocol(ProtocolAbstractProduct):
    """ An concrete product, represents https protocol """
     
    def __str__(self):
        return 'https'
 
class FTPProtocol(ProtocolAbstractProduct):
    """ An concrete product, represents ftp protocol """
     
    def __str__(self):
        return 'ftp'
 
class PortAbstractProduct(object, metaclass=abc.ABCMeta):
    """ An abstract product, represents port to connect """
     
    @abc.abstractmethod
    def __str__(self):
        pass
 
class HTTPPort(PortAbstractProduct):
    """ A concrete product which represents http port. """
     
    def __str__(self):
        return '80'
 
class HTTPSecurePort(PortAbstractProduct):
    """ A concrete product which represents https port """
    def __str__(self):
        return '443'
 
class FTPPort(PortAbstractProduct):
    """ A concrete products which represents ftp port. """
     
    def __str__(self):
        return '21'
 
class CrawlerAbstractProduct(object, metaclass=abc.ABCMeta):
    """ An Abstract product, represents parser to parse web content """
     
    @abc.abstractmethod
    def __call__(self, content):
        pass
 
class HTTPCrawler(CrawlerAbstractProduct):
    def __call__(self, content):
        """ Parses web content """
         
        filenames = []
        soup = BeautifulSoup(content, "html.parser")
        links = soup.table.findAll('a')
 
        for link in links:
            filenames.append(link['href'])
             
        return '\n'.join(filenames)
 
class FTPCrawler(CrawlerAbstractProduct):
    def __call__(self, content):
       
        """ Parse Web Content """
        content = str(content, 'utf-8')
        lines = content.split('\n')
        filenames = []
         
        for line in lines:
            splitted_line = line.split(None, 8)
            if len(splitted_line) == 9:
                filenames.append(splitted_line[-1])
 
        return '\n'.join(filenames)
 
class Connector(object):
    """ A client """
     
    def __init__(self, abstractfactory):
        """ calling all attributes
of a connector according to abstractfactory class. """
         
        self.protocol = abstractfactory.create_protocol()
        self.port = abstractfactory.create_port()
        self.crawl = abstractfactory.create_crawler()
 
    def read(self, host, path):
        url = str(self.protocol) + '://' + host + ':' + str(self.port) + path
        print('Connecting to', url)
        return urllib.request.urlopen(url, timeout=10).read()
 
if __name__ == "__main__":
    con_domain = 'ftp.freebsd.org'
    con_path = '/pub/FreeBSD/'
 
    con_protocol = input('Choose the protocol \
                    (0-http, 1-ftp): ')
     
    if con_protocol == '0':
        is_secure = input('Use secure connection? (1-yes, 0-no):')
        if is_secure == '1':
            is_secure = True
        else:
            is_secure = False
        abstractfactory = HTTPConcreteFactory(is_secure)
    else:
        is_secure = False
        abstractfactory = FTPConcreteFactory(is_secure)
 
    connector = Connector(abstractfactory)
 
    try:
        data = connector.read(con_domain, con_path)
    except urllib.error.URLError as e:
        print('Cannot access resource with this method', e)
    else:
        print(connector.crawl(data))

Output

Output

The goal of the program is to crawl the website using the HTTP protocol or FTP protocol. Here, we need to consider three scenarios while implementing the code.

Protocol
Port
Crawler

These three scenarios differ in the HTTP and FTP web access models. So, here we need to create two factories, one for creating HTTP products and another for creating FTP products – HTTPConcreteFactory and FTPConcreteFactory. These two concrete factories are derived from an abstract factory – AbstractFactory.

An abstract interface is used because the operation methods are the same for both factory classes, only the implementation is different, and hence the client code can determine which factory to using during the runtime. Let’s analyze the products created by each factory.

In the case of protocol product, HTTP concrete factory creates either http or https protocol, whereas, FTP concrete factory creates ftp protocol. For port products, HTTP concrete factory generates either 80 or 443 as a port product, and the FTP factory generates 21 as a port product. And finally, the crawler implementation differs because the website structure is different for HTTP and FTP.

Here, the created object has the same interface, whereas the created concrete objects are different for every factory. Say, for example, the port products such as HTTP port, HTTP Secure port, and FTP port have the same interface, but the concrete objects for both factories are different. The same is applicable for protocol and crawler as well.

Finally, the connector class accepts a factory and uses this factory to inject all attributes of the connector based on the factory class.

Suggest improvement

State Method - Python Design Patterns

How to Upload Project on GitHub from Jupyter Notebook?

Share your thoughts in the comments