Python3-处理连字符单词：合并和拆分

提问者：小点点

Python3-处理连字符单词：合并和拆分

我想处理连字符的单词。例如，我想用两种不同的方式来处理“众所周知”这个词。

首先，组合这个词，即（“众所周知”），第二种方法是拆分这个词，即（“众所周知”）。

输入将是："众所周知"，预期输出是：

--wellknown

--well

--known

但是我只能单独解析每个单词，但不能同时解析两个单词。当我循环浏览我的文本文件时，如果我正在寻找带连字符的单词，我会首先组合它们。

然后，在我组合它们之后，我不知道如何再次回到原始单词并执行拆分操作。以下是我的代码中的短片段。（如果您需要查看更多详细信息，请告诉我）

for text in contents.split():   
   if not re.search(r'\d', text):               #remove numbers
      if text not in string.punctuation:        #remove punctuation
        if '-' in term:
           combine = text.replace("-", '')      #??problem parts (wellknown)
           separate = re.split(r'[^a-z]', text) #??problem parts (well, known)

我知道我不能同时做这两个操作的原因，因为在我替换了带连字符的单词后，这个单词消失了。然后我找不到带连字符的单词来进行拆分（在代码中是“分离”）操作。有人知道怎么做吗？或者说如何修复逻辑？

共2个答案

匿名用户

为什么不使用包含分离词和组合词的元组呢。

先拆分，然后合并：

示例代码

separate = text.split('-')
combined = ''.join(separate)
words = (combined, separate[0], separate[1])

输出

('wellknown', 'well', 'known')

匿名用户

将令牌视为对象而不是字符串，然后可以创建具有多个属性的令牌。

例如，我们可以使用集合。namedtuple容器作为一个简单的对象来保存令牌：

from collections import namedtuple

from nltk import word_tokenize

Token = namedtuple('Token', ['surface', 'splitup', 'combined'])

text = "This is a well-known example of a small-business grant of $123,456."

tokenized_text = []

for token in word_tokenize(text):
    if '-' in token:
        this_token = Token(token, tuple(token.split('-')),  token.replace('-', ''))
    else:
        this_token = Token(token, token, token)
    tokenized_text.append(this_token)

for token in tokenized_text:
    print(token.surface)
    tokenized_text

[out]：

This
is
a
well-known
example
of
a
small-business
grant
of
$
123,456
.

如果您需要访问组合令牌：

for token in tokenized_text:
    print(token.combined)

[out]：

This
is
a
wellknown
example
of
a
smallbusiness
grant
of
$
123,456
.

如果你想访问拆分令牌，使用相同的循环，但你会看到你得到一个元组而不是字符串，例如。

for token in tokenized_text:
    print(token.splitup)

[out]：

This
is
a
('well', 'known')
example
of
a
('small', 'business')
grant
of
$
123,456
.

您也可以使用列表理解来访问标记的属性namedtuples，例如。

>>> [token.splitup for token in tokenized_text]
['This', 'is', 'a', ('well', 'known'), 'example', 'of', 'a', ('small', 'business'), 'grant', 'of', '$', '123,456', '.']

要识别带有连字符且已拆分的标记，您可以轻松检查其类型，例如。

>>> [type(token.splitup) for token in tokenized_text]
[str, str, str, tuple, str, str, str, tuple, str, str, str, str, str]


		      
                相关问题
                

																                
					
										   如何迭代Hashmap并与同一Hashmap中的其他键进行组合以比较它们的对象
										   结合主体时不更新在模板中的异步管道可观察
										   Angular RxJS-取消订阅合并映射？
										   结合RxJava了解Android内存泄漏
										   C 20概念需要运算符重载结合用户自定义模板运算符重载功能
										   类模板特化部分排序和功能合成
										   discord.py音乐机器人：如何组合播放和队列命令
										   MongoDB计数，最小，最大，平均使用带有对象列表的字段进行聚合
										   在MongoDB中将字符串日期转换为时间戳
										   按日期分组Mongo聚合
										   带有聚合管道的文本搜索-MongoDB/PHP
										   如何从oracle中的列中提取子字符串？
										   MongoDB聚合
										   Mongo不会使用$gte和$date返回聚合中的文档
										   C/C不允许文字字符串连接
										   将整数列表转换为逗号分隔的字符串
										   超文本标记语言表格设置一列宽度，其余列适合内容，而表格宽度设置为自动
										   iText在Android中合并pdf错误，使用最新库
										   查找任何字段与搜索字符串匹配的文档[重复]
										   MongoDB从另一个集合中的字段查找集合中的文档

Python3-处理连字符单词：合并和拆分

共2个答案

相关问题

热门标签

微信关注