Skip to content

Restructure Notes

Christopher Kent Hoadley edited this page Dec 25, 2019 · 5 revisions

Purpose

This is some notes about the long overdue restructuring of Sherlock.

Goals

  1. Make Sherlock work as a general package.

  2. Allow extensibility for other types of detection mechanisms.
    Many websites now require Javascript to be enabled before one can query about support for a given username. We need a line of site to achieve this.

  3. Allow Core Functionality To Be Other Than As A Command Line Tool
    While Sherlock has started out as a command line tool, it would also be useful to allow it to be used as an API or as a module for a web site.

Detail

Print Statements

Sherlock has bunches of areas where print statements are embedded directly into the functional aspects of the code. This needs to be removed. Some of these print statements are for troubleshooting (these should be integrated into the logging module). Others are for printing the results (these need to be deferred to a higher level module whose sole responsibility is printing the results).

Dependencies

Here are the current dependencies for Sherlock:

beautifulsoup4>=4.8.0
bs4>=0.0.1
certifi>=2019.6.16
colorama>=0.4.1
lxml>=4.4.0
PySocks>=1.7.0
requests>=2.22.0
requests-futures>=1.0.0
soupsieve>=1.9.2
stem>=1.7.1
torrequest>=0.1.0
Can we get rid of any of these? I have lost count of how many times people have complained that Sherlock is not working, and it is just that they do not have Requests installed.

Detection Method Abstraction

This is the area that needs the most work.

There are currently 3 detection mechanisms.

  • HTTP Status
    As an optimization, only the headers are fetched for this mechanism. The text of the page is not needed: only the status code.
    NOTE: Why is there is a special check for GitHub? The headers optimization is not enabled for it.

  • Response URL
    This method disallows redirects. This is so the original status code can be captured.

  • Error Message
    The bête noire of Sherlock…​this does a blind search of the text in the response, and detects a failure based on there being a match or not.

Each of these are simply requests. But, we know that Sherlock is going to have to start understanding Javascript if it is going to be able to determine username availability on some sites. So, the detection method abstraction is going to have to be much more open. At the same time, any new detection methods need to be able to run in parallel. And, there needs to be enough infrastructure so that the options that future methods might use can be passed from the data.json to the method.

I am thinking that there needs to be a base class SherlockDetect(). The only job of this class is to do the query. Perhaps this query is to do a basic request and see the return status. Perhaps, it will use Selenium to simulate an entire browser session. But, when one runs a query via this class, the only thing you are going to get back is an enumeration something like the following:

from enum import Enum

class QueryResult(Enum):
    """Query Result Enumeration.

    Describes result of query about a given username.
    """
    CLAIMED   = "Claimed"   #Username Detected
    AVAILABLE = "Available" #Username Not Detected
    UNKNOWN   = "Unknown"   #Error Occurred While Trying To Detect Username
    ILLEGAL   = "Illegal"   #Username Not Allowable For This Site

    def __str__(self):
        """Convert Object To String.

        Keyword Arguments:
        self                   -- This object.

        Return Value:
        Nicely formatted string to get information about this object.
        """
        return self.value
So, we can inherit from the base SherlockDetect() other detection methods. SherlockDetectRequestStatus(), SherlockDetectResponseUrl(), or SherlockDetectErrorMessage()…​ The individual classes will be able to contain the logic of doing the request in their individual fashion. And, interpreting the results of this request.

Detection Engine

This is the area that needs almost as much work as the detection method.

While we need to be able to work with many detection methods, we don’t want to re-implementing everything each time a new detection method is added. Any aspect of detection that is going to be common for all detection methods should be built into a the Detection Engine. We should be DRY.

What are some of the things to do in the Detection Engine?

  • Gather together the list of sites that should be queried.
    We might want to check a username against all of the sites we know about. We might want to check it against only one.

  • Apply valid user name check (if available)
    For sites where we have established a regular expression that describes an acceptable format for usernames, we can do this check before the query. This will avoid the costly overhead of doing a query when it is not needed, and will also unburden the detection methods from having to do this.

  • Kick off query in parallel.
    The parallel processing is what gives Sherlock its speed.

  • Accumulate results of query
    Monitor the parallel process, and gather together the results.

Accumulating the results is going to be tricky. The Detection Engine should not be doing print statements, but the caller of it might. If we block until all of the results have been established, it will not allow the results to stream to the user as they complete: the entire time that the query takes will not be any faster, but if you make a human wait for results, they will be frustrated. Once when the caller of the Detection Engine starts the query, they need to poll some completion status? Grab results as they come in? Should their be an alternate interface available that is blocking?