HOME | EDIT | RSS | ABOUT | GITHUB

python将段落分割成句子

人间戏剧 cd Calibre\ Library/Honore\ de\ Balzac/Collected\ Works\ of\ Honore\ de\ Balzac\ with\ the\ Complete\ Human\ Comedy\ \(Delphi\ Classics\)\ \(122\)/Collected\ Works\ of\ Honore\ de\ Balzac\ with\ t\ -\ Honore\ de\ Balzac adb push text_part0025.html /sdcard/WebScrapBook/data/bookforweb/Collected\ Works\ of\ Honore\ de\ Balzac\ with\ t\ -\ Honore\ de\ Balzac/ 追忆似水年华 cd Calibre\ Library/Marcel\ Proust/Swann\'s\ Way\ \(9\)/Swann\'s\ Way\ -\ Marcel\ Proust/OEBPS/ adb push \@public\@vhost\@g\@gutenberg\@html\@files\@7178\@7178-h\@7178-h-0.htm.html /sdcard/WebScrapBook/data/bookforweb/Swann\'s\ Way\ -\ Marcel\ Proust/OEBPS/

分割完后用emacs替换 .ctrl-q ctrl-j.</p>c-qc-j<p>

# This is a sample Python script.
import nltk
filename = "a.html"
file = open(filename, "r", encoding="utf-8")
text = file.read()
text = text.replace("\n", " ")
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
sentences = tokenizer.tokenize(text)
# Press ⌃R to execute it or replace it with your code.
# Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings.
#print(sentences)

with open(
        'a1.html'
        , 'w') as f:
    for line in sentences:
        f.write(line)
        f.write('\n')
    f.close()

def print_hi(name):
    # Use a breakpoint in the code line below to debug your script.
#    print(f'Hi, {name}')  # Press ⌘F8 to toggle the breakpoint.
    print(sentences)

# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    print_hi('PyCharm')

# See PyCharm help at https://www.jetbrains.com/help/pycharm/