Comparing URLs for similarity
Urls, links, webpages are archived in WARC
format. I had experimented with creating WARC
files when I was working on . Back then I tried using library from Internet Archive, but it was not maintained, not compatible with python3 and didn't work.
I came across this another python archiving library: warcio, it has simple API that can create and read Archive files. It got me excited to resume working on .
warcio
monkey patches requests
and capture all GET
requests to create a single WARC
file. This WARC
file can be stored and accessed anytime and ideally should render just like the original webpage, even if original is removed, deleted or no longer exists. To make the archive as close to the original we need to fetch all static content(images, javascript, css, icons) embedded or used in a webpage(HTML
), I use HTML parsing library to find links to such resources. Then I repeatedly fetch these resources using requests
library and meanwhile the warcio
neatly tucks all these resources into single WARC
file and it just works.
One step in optimizing the process of fetching these resources is to avoid redundant fetches of same URLs that don't appear similar. These URLs are in principle same, but because of some caveats are not similar in their string
representation. For example:
-
https://www.mygov.in/covid-19/ and https://www.mygov.in/covid-19 has trailing
/
, but both are same. -
https://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in subdomain
wwww.
, and are similar. -
http://mygov.in/covid-19 and https://www.mygov.in/covid-19 have difference in their protocol
http
andhttps
, but are similar.
So I put together a small function that tries to compare these URLs and see if they are same or not:
from urllib.parse import urlparse
def check_url_similarity(url_1, url_2):
'''Method to compare two URLs to identify if they are same or not.
Returns bool: True/False based on comparison'''
def check_path(path_1, path_2):
# handles cases: cases where path are similar and just have a trailing /
if path_1 == path_2:
return True
if path_1 == path_2+'/' or \
path_1+'/' == path_2:
return True
else:
return False
if len(url_2) == len(url_1):
if url_1 == url_2:
return True
else:
url_1_struct = urlparse(url_1)
url_2_struct = urlparse(url_2)
if url_1_struct.netloc == url_2_struct.netloc:
if check_path(url_1_struct.path, url_2_struct.path):
return True
if url_1_struct.netloc == 'www.'+url_2_struct.netloc or \
'www.'+url_1_struct.netloc == url_2_struct.netloc:
if check_path(url_1_struct.path, url_2_struct.path):
return True
return False
And I wrote these tests to make sure that this function is doing what I expect it to do:
class TestUrlSimilarity(unittest.TestCase):
def test_trailing_slash(self):
url_1 = "https://www.mygov.in/covid-19/"
url_2 = "https://www.mygov.in/covid-19"
self.assertTrue(check_url_similarity(url_1, url_2))
def test_missing_www_subdomain(self):
url_1 = "https://mygov.in/covid-19"
url_2 = "https://www.mygov.in/covid-19"
self.assertTrue(check_url_similarity(url_1, url_2))
def test_missing_www_subdomain_and_trailing_slash(self):
url_1 = "https://mygov.in/covid-19/"
url_2 = "https://www.mygov.in/covid-19"
self.assertTrue(check_url_similarity(url_1, url_2))
url_1 = "https://mygov.in/covid-19"
url_2 = "https://www.mygov.in/covid-19/"
self.assertTrue(check_url_similarity(url_1, url_2))
def test_http_difference(self):
url_1 = "https://mygov.in/covid-19"
url_2 = "http://www.mygov.in/covid-19"
self.assertTrue(check_url_similarity(url_1, url_2))
def test_different_url(self):
url_1 = "https://mygov.in/covid-19"
url_2 = "https://www.india.gov.in/"
self.assertFalse(check_url_similarity(url_1, url_2))