Comparing URLs for similarity



Urls, links, webpages are archived in WARC format. I had experimented with creating WARC files when I was working on SoFee. Back then I tried using library from Internet Archive, but it was not maintained, not compatible with python3 and didn't work.

I came across this another python archiving library: warcio, it has simple API that can create and read Archive files. It got me excited to resume working on SoFee 2.0.

warcio monkey patches requests and capture all GET requests to create a single WARC file. This WARC file can be stored and accessed anytime and ideally should render just like the original webpage, even if original is removed, deleted or no longer exists. To make the archive as close to the original we need to fetch all static content(images, javascript, css, icons) embedded or used in a webpage(HTML), I use HTML parsing library to find links to such resources. Then I repeatedly fetch these resources using requests library and meanwhile the warcio neatly tucks all these resources into single WARC file and it just works.

One step in optimizing the process of fetching these resources is to avoid redundant fetches of same URLs that don't appear similar. These URLs are in principle same, but because of some caveats are not similar in their string representation. For example:

  1. https://www.mygov.in/covid-19/ and https://www.mygov.in/covid-19 has trailing /, but both are same.
  2. https://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in subdomain wwww., and are similar.
  3. http://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in their protocol http and https, but are similar.

So I put together a small function that tries to compare these URLs and see if they are same or not:

from urllib.parse import urlparse
def check_url_similarity(url_1, url_2):
    '''Method of compare two URLs to identify if they are same or not.
    Returns bool: True/False based on comparison'''
    def check_path(path_1, path_2):
	# handles cases: cases where path are similar and just have a trailing /
	if path_1 == path_2:
	    return True
	if path_1 == path_2+'/' or \
	       path_1+'/' == path_2:
	    return True
	else:
	    return False

    if len(url_2) == len(url_1):
	if url_1 == url_2:
	    return True
    else:
	url_1_struct = urlparse(url_1)
	url_2_struct = urlparse(url_2)
	if url_1_struct.netloc == url_2_struct.netloc:
	    if check_path(url_1_struct.path, url_2_struct.path):
		return True
	if url_1_struct.netloc == 'www.'+url_2_struct.netloc or \
	   'www.'+url_1_struct.netloc == url_2_struct.netloc:
	    if check_path(url_1_struct.path, url_2_struct.path):
		return True
    return False

And I wrote these tests to make sure that this function is doing what I expect it to do:

class TestUrlSimilarity(unittest.TestCase):
    def test_trailing_slash(self):
	url_1 = "https://www.mygov.in/covid-19/"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_missing_www_subdomain(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_missing_www_subdomain_and_trailing_slash(self):
	url_1 = "https://mygov.in/covid-19/"
	url_2 = "https://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.mygov.in/covid-19/"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_http_difference(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "http://www.mygov.in/covid-19"
	self.assertTrue(check_url_similarity(url_1, url_2))

    def test_different_url(self):
	url_1 = "https://mygov.in/covid-19"
	url_2 = "https://www.india.gov.in/"
	self.assertFalse(check_url_similarity(url_1, url_2))