SourceWolf – A CLI Web Crawler Tool in Linux

Last Updated : 14 Sep, 2021

Web crawling is the process of indexing data on web pages by using a program or automated script and these automated scripts or programs are known by multiple names, that includes web crawler, spider, spider bot, and often shortened to the crawler. Manual crawling consumes a lot of time if the scope of the target is more. SourceWolf is an automated script developed in the Python Language that crawls the directories from the domain server and the status code. This can help the tester to test the pages whose responses are 200 or 301 quickly. SourceWolf is an open-source and free-to-use tool. SourceWolf tool supports custom word lists for brute-forcing. The output feature of SourceWolf is excellent as the output is stored in the leading directory, and the main directory contains sub-directories with separates status code directories.

What can SourceWolf do?

SourceWolf tool can crawl the responses to identify the hidden endpoints of the target domain.
SourceWolf tool can create a verbose list of identified JavaScript sources variables.
SourceWolf tool supports brute-forcing of the files and directories by using the custom word list.
SourceWolf tool can display the status code of each director visited on the target server.
SourceWolf tool provides us with an option to crawl the responses files locally so that you aren’t sending requests again to an endpoint, whose response you already have a copy of.

3 modes of SourceWolf

1. Crawl response mode: In this Mode, the hidden endpoints are discovered and saved in the text file format.

python3 sourcewolf.py -l domains.txt -o output/ -c crawl_output

2. Brute force mode: In this Mode Brute Forcing attack is done for the detection of files and directories on the target domain. A wordlist is used to brute force.

python3 sourcewolf.py -b https://geeksforgeeks.org/FUZZ -w /usr/share/wordlists/dirb/common.txt -s status

3. Probing mode: In this Mode, the activeness of the target domain is checked. Whether is target host is live or not is verified in this mode.

python3 sourcewolf.py -l domains.txt -s live

How can this be integrated into your workflow?

SourceWolf tool can be very beneficial in our workflow. This tool has the support to filter out the live domains. So we can enumerate the subdomains from Amass, AssetFinder, and Sublist3r tool and pass the list to the SourceWolf tool. SourceWolf tool will filter out only responsive or live subdomains so we can test only the live subdomains rather than wasting the time on inactive subdomains. This tool can also be useful for finding the endpoint of the target domain.

Naming conventions

To creep the files locally, we must follow some naming conventions rules. These conventions are in place for SourceWolf to directly identify the hostname, and thereby parse all the endpoints, including the relative ones.

Consider an URL https://geeksforgeeks.org/api/

Remove the https (protocol) and the trailing slash (//) (if any) from the URL –> geeksforgeeks.org/api
Replace ‘/’ with ‘@’ –> geeksforgeeks@api
Save the response as a text file with the file name obtained above.

So the file finally looks like geeksforgeeks@api.txt

Note: Make Sure You have Python Installed on your System, as this is a python-based tool. Click to check the Installation process: Python Installation Steps on Linux

Installation of SourceWolf Tool on Kali Linux OS

Step 1: Check whether Python Environment is Established or not, use the following command.

python3

Step 2: Open up your Kali Linux terminal and move to Desktop using the following command.

cd Desktop

Step 3: You are on Desktop now create a new directory called SourceWolf using the following command. In this directory, we will complete the installation of the SourceWolf tool.

mkdir SourceWolf

Step 4: Now switch to the SourceWolf directory using the following command.

cd SourceWolf

Step 5: Now you have to install the tool. You have to clone the tool from GitHub.

git clone https://github.com/micha3lb3n/SourceWolf.git

Step 6: The tool has been downloaded successfully in the SourceWolf directory. Now list out the contents of the tool by using the below command.

ls

Step 7: You can observe that there is a new directory created of the SourceWolf tool that has been generated while we were installing the tool. Now move to that directory using the below command:

cd SourceWolf

Step 8: Once again to discover the contents of the tool, use the below command.

ls

Step 9: Download the required packages for running the tool, use the following command.

pip3 install -r requirements.txt

Step 10: Now we are done with our installation, Use the below command to view the help (gives a better understanding of the tool) index of the tool.

python3 sourcewolf.py -h

Working with SourceWolf Tool on Kali Linux OS

Example 1: Simple Usage

python3 sourcewolf.py --url http://geeksforgeeks.org/wp-admin

In this example, We are testing only a single directory on the target domain geeksforgeeks.org. We have got 500 as a status code which defines that there is a generic error response from the server.

Example 2: Brute force

python3 sourcewolf.py -b http://geeksforgeeks.org/FUZZ

1. In this example, We will be brute-forcing directories on the geeksforgeeks.org domain. We are using a custom or default word list for brute-forcing.

2. In the below Screenshot, We have got the results with the server response status.

Example 3: Verbose

python3 sourcewolf.py -b http://geeksforgeeks.org/FUZZ -v

1. In this example, We will be printing the results in a more realistic way or in more detail. We have used -v tag for verbose mode.

2. In the below Screenshot, We have got the results in real-time and all the directories tested are shown in the terminal with the status code returned from the server.

Example 4: Wordlist

python3 sourcewolf.py -b http://geeksforgeeks.org/FUZZ -w /usr/share/wordlists/dirb/common.txt

1. In this example, We will be using the custom word list which is specified in the -w tag.

2. In the below Screenshot, We have specified the command for using the custom word list.

3. In the below Screenshot, We have got the results of our fuzz and we are trying to open the http://geeksforgeeks.org/About URL whose status code is 200 (Ok).

4. In the below Screenshot, We have opened the About page URL on the web browser.

Example 5: Output

python3 sourcewolf.py -b http://geeksforgeeks.org/FUZZ -w /usr/share/wordlists/dirb/common.txt -o ok

1. In this Example, We are saving the results on our disk for further use. We are using the -o tag along with the name of the directory where results will be saved.

2. In the below Screenshot, We have got the results of our scan.

3. In the below Screenshot, New directories are created with the name of status codes. In this directory, the associated status code web page information will be saved.