一万条url找出相似的url

算法课抽到的另一道题，大厂面试题收藏了1万条url，现在给你一条url，如何找出相似的url。（面试官不解释何为相似）方法一：若url属于同一服务，一个URL是另一个URL的前缀，或者两个URL的前面的目录相同，可利用正则表达式#!/usr/bin/env python# -*- coding:utf-8 -*-# 认为前面的目录相同则为相似import regiven_url = 'https:

飞飞帅傅

747人浏览 · 2021-06-09 11:49:32

飞飞帅傅 · 2021-06-09 11:49:32 发布

算法课抽到的另一道题，大厂面试题

收藏了1万条url，现在给你一条url，如何找出相似的url。（面试官不解释何为相似）。

方法一：若url属于同一服务，一个URL是另一个URL的前缀，或者两个URL的前面的目录相同，可利用正则表达式

#!/usr/bin/env python
# -*- coding:utf-8 -*-

# 认为前面的目录相同则为相似
import re

given_url = 'https://blog.csdn.net/weixin_51617086'
urls = ['https://blog.csdn.net/weixin_51617086/article/details/117405692?spm=1001.2014.3001.5501',
        'https://blog.csdn.net/weixin_51617086/article/details/117339353?spm=1001.2014.3001.5501',
        'https://blog.csdn.net/weixin_51617086/article/details/117299551?spm=1001.2014.3001.5501',
        'https://space.bilibili.com/146015656/favlist?fid=1239219656&ftype=create',
        'https://so.csdn.net/so/search',
        'https://blog.csd./251/nweixin_51617086']

for url in urls:
    rs = re.match(r'https://blog.csdn.net/weixin_51617086.*', url)
    if rs is not None:
        print("相似url:{}".format(rs.group()))

在这里插入图片描述

方法二：若指URL字串上大都相同，考虑相似是指匹配字符较多。那就用动态规划求字符对齐匹配数最大的。那么较多是多少呢？这个标准可以人为取，假设认为80%以上才叫相似。达标的就认为是相似的转化为：求最长公共子序列的题目。

最长公共子序列

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# author:Zfy  date:2021/5/31 22:50


# 二、URL字串上大都相同则为相似，拟定为80%以上为相似
given_url = 'https://blog.csdn.net/weixin_51617086'
urls = ['https://blog.csdn.net/weixin_51617086/article/details/117405692?spm=1001.2014.3001.5501',
        'https://blog.csdn.net/weixin_51617086/article/details/117339353?spm=1001.2014.3001.5501',
        'https://blog.csdn.net/weixin_51617086/article/details/117299551?spm=1001.2014.3001.5501',
        'https://space.bilibili.com/146015656/favlist?fid=1239219656&ftype=create',
        'https://so.csdn.net/so/search',
        'https://blog.csd./251/nweixin_51617086']

def lcs(str1, str2):
    len1 = len(str1)
    len2 = len(str2)
    arr = [[0 for _ in range(len2+1)] for _ in range(len1+1)]  # len1+1行，len2+1列
    b = [[0 for _ in range(len2+1)] for _ in range(len1+1)]  # 用于记录字符位置 1 左上方 2 上方 3 左方
    for i in range(1, len1+1):
        for j in range(1, len2+1):
            if str1[i-1] == str2[j-1]:  # i j位置上的字符匹配的时候，来自于左上方+1
                arr[i][j] = arr[i-1][j-1] + 1
                b[i][j] = 1
            elif arr[i-1][j] > arr[i][j-1]:  # 来自于上方
                arr[i][j] = arr[i-1][j]
                b[i][j] = 2
            else:
                arr[i][j] = arr[i][j-1]
                b[i][j] = 3
    return arr[-1][-1], b

def lcs_trackback(x, y):  # 回溯
    arr, b = lcs(x, y)
    i = len(x)
    j = len(y)
    res = []
    while i > 0 and j > 0:
        if b[i][j] == 1:  # 来自于左上方 匹配
            res.append(x[i-1])
            i -= 1
            j -= 1
        elif b[i][j] == 2:  # 来自于上方 不匹配
            i -= 1
        else:
            j -= 1  # 来自于左方 不匹配
    return ''.join(reversed(res))


for url in urls:
    com = lcs_trackback(given_url, url)
    if len(com) / len(given_url) >= 0.8:
        print("公共子序列：{}，url:{}".format(com, url))