python将段落分割成句子
人间戏剧
cd Calibre\ Library/Honore\ de\ Balzac/Collected\ Works\ of\ Honore\ de\ Balzac\ with\ the\ Complete\ Human\ Comedy\ \(Delphi\ Classics\)\ \(122\)/Collected\ Works\ of\ Honore\ de\ Balzac\ with\ t\ -\ Honore\ de\ Balzac
adb push text_part0025.html /sdcard/WebScrapBook/data/bookforweb/Collected\ Works\ of\ Honore\ de\ Balzac\ with\ t\ -\ Honore\ de\ Balzac/
追忆似水年华
cd Calibre\ Library/Marcel\ Proust/Swann\'s\ Way\ \(9\)/Swann\'s\ Way\ -\ Marcel\ Proust/OEBPS/
adb push \@public\@vhost\@g\@gutenberg\@html\@files\@7178\@7178-h\@7178-h-0.htm.html /sdcard/WebScrapBook/data/bookforweb/Swann\'s\ Way\ -\ Marcel\ Proust/OEBPS/
分割完后用emacs替换 .ctrl-q ctrl-j 为 .</p>c-qc-j<p>
# This is a sample Python script.
import nltk
filename = "a.html"
file = open(filename, "r", encoding="utf-8")
text = file.read()
text = text.replace("\n", " ")
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
sentences = tokenizer.tokenize(text)
# Press ⌃R to execute it or replace it with your code.
# Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings.
#print(sentences)
with open(
'a1.html'
, 'w') as f:
for line in sentences:
f.write(line)
f.write('\n')
f.close()
def print_hi(name):
# Use a breakpoint in the code line below to debug your script.
# print(f'Hi, {name}') # Press ⌘F8 to toggle the breakpoint.
print(sentences)
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
print_hi('PyCharm')
# See PyCharm help at https://www.jetbrains.com/help/pycharm/