Implementing Web Crawler using Abstract Factory Design Pattern in Python
In the Abstract Factory design pattern, every product has an abstract product interface. This approach facilitates the creation of families of related objects that is independent of their factory classes. As a result, you can change the factory at runtime to get a different object – simplifies the replacement of the product families.
In this design pattern, the client uses an abstract factory interface to access objects. The abstract interface separates the creation of objects from the client, which makes the manipulation easier and isolates the concrete classes from the client. However, adding new products to the existing factory is difficult because you need to extend the factory interface, which includes changing the abstract factory interface class and all its subclasses.
Let’s look into the web crawler implementation in Python for a better understanding. As shown in the following diagram, you have an abstract factory interface class – AbstractFactory – and two concrete factory classes – HTTPConcreteFactory and FTPConcreteFactory. These two concrete classes are derived from the AbstractFactory class and have methods to create instances of three interfaces – ProtocolAbstractProduct, PortAbstractProduct, and CrawlerAbstractProduct.
Since AbstractFactory class acts as an interface for the factories such as HTTPConcreteFactory and FTPConcreteFactory, it has three abstract methods – create_protocol(), create_port(), create_crawler(). These methods are redefined in the factory classes. That means HTTPConcreteFactory class creates its family of related objects such as HTTPPort, HTTPSecurePort, and HTTPSecureProtocol, whereas, FTPConcreteFactory class creates FTPPort, FTPProtocol, and FTPCrawler.
The goal of the program is to crawl the website using the HTTP protocol or FTP protocol. Here, we need to consider three scenarios while implementing the code.
These three scenarios differ in the HTTP and FTP web access models. So, here we need to create two factories, one for creating HTTP products and another for creating FTP products – HTTPConcreteFactory and FTPConcreteFactory. These two concrete factories are derived from an abstract factory – AbstractFactory.
An abstract interface is used because the operation methods are the same for both factory classes, only the implementation is different, and hence the client code can determine which factory to using during the runtime. Let’s analyze the products created by each factory.
In the case of protocol product, HTTP concrete factory creates either http or https protocol, whereas, FTP concrete factory creates ftp protocol. For port products, HTTP concrete factory generates either 80 or 443 as a port product, and the FTP factory generates 21 as a port product. And finally, the crawler implementation differs because the website structure is different for HTTP and FTP.
Here, the created object has the same interface, whereas the created concrete objects are different for every factory. Say, for example, the port products such as HTTP port, HTTP Secure port, and FTP port have the same interface, but the concrete objects for both factories are different. The same is applicable for protocol and crawler as well.
Finally, the connector class accepts a factory and uses this factory to inject all attributes of the connector based on the factory class.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course