Tokenizer offerset mapping

Reason is the light and the light of life.

Jerry Su Nov 23, 2021 2 mins
offsets = [(0, 0), (0, 1), (1, 2), (2, 3), (3, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17), (17, 18), (18, 19), (19, 20), (20, 21), (21, 25), (25, 26), (26, 27), (27, 28), (28, 29), (29, 30), (30, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 36), (36, 37), (37, 38), (38, 39), (39, 40), (40, 41), (41, 42), (42, 43), (43, 44), (44, 45), (45, 46), (46, 47), (0, 0)]
tokens = ['[CLS]', '对', '儿', '童', 'sars', '##t', '细', '胞', '亚', '群', '的', '研', '究', '表', '明', ',', '与', '成', '人', 'sars', '相', '比', ',', '儿', '童', '细', '胞', '下', '降', '不', '明', '显', ',', '证', '明', '上', '述', '推', '测', '成', '立', '。', '[SEP]']
len(offsets) == len(tokens)
True
"""
offset: (start, end)

start = text.index(token) token在原文本的

end = start + len(tokens)   # len('##') = 0 means except "##"
"""

for idx, (offset, token) in enumerate(zip(offsets, tokens)):
    print(idx - 1, offset, token, offset[0], offset[-1] - 1) 
-1 (0, 0) [CLS] 0 -1
0 (0, 1)  0 0
1 (1, 2)  1 1
2 (2, 3)  2 2
3 (3, 7) sars 3 6
4 (7, 8) ##t 7 7
5 (8, 9)  8 8
6 (9, 10)  9 9
7 (10, 11)  10 10
8 (11, 12)  11 11
9 (12, 13)  12 12
10 (13, 14)  13 13
11 (14, 15)  14 14
12 (15, 16)  15 15
13 (16, 17)  16 16
14 (17, 18)  17 17
15 (18, 19)  18 18
16 (19, 20)  19 19
17 (20, 21)  20 20
18 (21, 25) sars 21 24
19 (25, 26)  25 25
20 (26, 27)  26 26
21 (27, 28)  27 27
22 (28, 29)  28 28
23 (29, 30)  29 29
24 (30, 31)  30 30
25 (31, 32)  31 31
26 (32, 33)  32 32
27 (33, 34)  33 33
28 (34, 35)  34 34
29 (35, 36)  35 35
30 (36, 37)  36 36
31 (37, 38)  37 37
32 (38, 39)  38 38
33 (39, 40)  39 39
34 (40, 41)  40 40
35 (41, 42)  41 41
36 (42, 43)  42 42
37 (43, 44)  43 43
38 (44, 45)  44 44
39 (45, 46)  45 45
40 (46, 47)  46 46
41 (0, 0) [SEP] 0 -1

Read more:

Related posts: