Implementing a Trie in Python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106


from typing import Tuple

# char: 单字符,形如'h','a','c','k'
# word: 词,形如"hack"


class TrieNode(object):
    """
    字符节点
    """
    def __init__(self, char: str):
        self.char = char  # 存储的字符
        self.children = []  # 该节点的子节点
        self.word_finished = False  # 是否是词尾
        self.counter = 1  # 出现在word中的次数


def add(root, word: str):
    """
    向字典树中添加词
    """
    node = root  # node的初始值为root节点*
    for char in word:  # 遍历整个word
        found_in_child = False  # 标记当前char是否在node的子结点列表中
        for child in node.children:
            if child.char == char:  # node节点的子节点中包含当前的char
                child.counter += 1  # 该字符在word中出现的次数加一
                node = child  # node向下移动一位,即为当前字符
                found_in_child = True  # 标记为找到
                break
        if not found_in_child:
            # 当前字符不在node的子节点列表中,则新建一个节点加入到node的子节点列表中
            new_node = TrieNode(char)
            node.children.append(new_node)
            node = new_node  # node移至当前节点
    # 当word中所有字符被遍历完后,node即为最后一个字符,将其标记为词尾
    node.word_finished = True


def find_profix(root, prefix: str) -> Tuple[bool, int]:
    """
    检查并返回:
    1. 该前缀是否出现在添加过的词中
    2. 如果是,那么有多少词组包含该前缀
    """
    node = root
    if not root.children:  # 如果当前节点无子节点则直接返回(False,0)
        return False, 0
    for char in prefix:  # 遍历prefix中的字符
        char_not_found = True  # 标记当前字符是否出现在node的子节点列表中,默认为True
        for child in node.children:
            if child.char == char:  # 如果当前字符包含在node的子结点中
                char_not_found = False
                node = child  # node向前移一位,即移至当前字符的位置
                break
        if char_not_found:  # 如果当前字符不在node的子结点中则直接返回(False,0)
            return False, 0
    # 执行至此步则表示prefix的所有字符都出现在字典树中,直接返回True和最后一个字符的counter
    return True, node.counter


def get_all_words(root: TrieNode):
    """
    获取Trie中所有的词组
    """
    all_words = []
    current_word = []
    # 递归方式
    def tmp(root: TrieNode):
        for node in root.children:  # 遍历当前节点的所有子节点
            current_word.append(node.char)  # 将当前子节点加入到current_word中
            if node.word_finished:  # 如果当前字符为词尾
                all_words.append(current_word.copy())  # 将当前词组加入到all_words中
            if node.children:
                tmp(node)  # 当前子节点有子节点, 继续递归遍历
            else:
                current_word.pop()  # 当前子节点无子节点即为词尾,出栈
        if current_word:
            current_word.pop()
        return

    tmp(root)  # 从根节点开始
    return [''.join(word) for word in all_words]


if __name__ == "__main__":
    # 构造根节点和字典树
    root = TrieNode('*')
    add(root, "hackathon")
    add(root, "hack")

    # 在字典树中查找
    print(find_profix(root, "hac"))
    print(find_profix(root, 'hack'))
    print(find_profix(root, 'hackathon'))
    print(find_profix(root, 'ha'))
    print(find_profix(root, 'hammer'))
    print(get_all_words(root))

# 输出结果为:
# (True, 2)
# (True, 2)
# (True, 1)
# (True, 2)
# (False, 0)
# ['hack', 'hackathon']

工作原理

算法的主要步骤

首先要考虑的是如何将word加入到Trie中, 这也是add方法的职责所在。它的工作方式非常简单。它需要两个参数:root node(即根节点，一般使用*)和word
然后它从单词的第一个字符开始遍历, 一次一个字符
检查当前node的子节点中是否含有该字符
如果有, 则只增加该字符的counter, 以表明该字符是重复出现
如果没有, 则只用简单的将该字符添加到当前node节点的子节点列表中
对于4&5这两种情况, 在考虑下一个字符之前, 都是使用node的child node作为当前节点(这意味着, 当前的child node在下一个循环中将成为node节点)

这就是在Trie中添加一个单词的所有步骤。还需要做的另一件事是在整个过程完成后标记单词的结尾。这意味着Trie的每个叶节点的word_finished属性为True。

要搜索前缀，只需执行几个简单的步骤

find_profix函数需要两个参数, root节点和需要搜索的profix
每次从peofix中按循序取一个字符与"当前node"的子节点比较, 找出包含该字符的节点
如果找到了, 则将该子节点作为当前node进行下一轮比较
如果未找到则返回(False, 0)表明在Trie中该前缀不存在
在试图找到一个比单词本身更大的前缀时, 该算法也将返回(False, 0)
如果Trie包含该前缀则返回(True, 该前缀出现在Trie包含的word中的次数)(即在Trie包含的所有词组中有多少个词包含该前缀)

摘自: Meidum: Implementing a Trie in Python

原作者: Shubhadeep Roychowdhury