Recommendation of 6 Python Special Text Format Processing Libraries

Keywords: Python xml Excel JSON

Links to the original text: https://www.jianshu.com/u/8f2987e2f9fb

Below are some libraries written by Python for parsing and manipulating special text formats, hoping to help you.

01.Tablib

https://www.oschina.net/p/Tablib

Tablib is a Python library for processing table format data. It allows importing, exporting and managing table format data. It also has advanced functions such as slicing, dynamic columns, labels and filtering, and formatting import and export.

Tablib supports export/import formats including Excel, JSON, YAML, HTML, TSV and CSV, and does not support XML for the time being.

'''
//Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other.
//There are also good video learning tutorials and PDF e-books in the group!
'''
>>> data = tablib.Dataset(headers=['First Name', 'Last Name', 'Age'])

>>> for i in [('Kenneth', 'Reitz', 22), ('Bessie', 'Monke', 21)]:

...     data.append(i)

>>> print(data.export('json'))

[{"Last Name": "Reitz", "First Name": "Kenneth", "Age": 22}, {"Last Name": "Monke", "First Name": "Bessie", "Age": 21}]

>>> print(data.export('yaml'))

- {Age: 22, First Name: Kenneth, Last Name: Reitz}

- {Age: 21, First Name: Bessie, Last Name: Monke}

>>> data.export('xlsx')

<censored binary data>

>>> data.export('df')

  First Name Last Name  Age

0    Kenneth     Reitz   22

1     Bessie     Monke   21

02.Openpyxl

https://www.oschina.net/p/openpyxl

Openpyxl is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.

Openpyxl was originally developed for Python's native read/write Office Open XML format based on PHPExcel.

'''
//Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other.
//There are also good video learning tutorials and PDF e-books in the group!
'''
from openpyxl import Workbook

wb = Workbook()

# grab the active worksheet

ws = wb.active

# Data can be assigned directly to cells

ws['A1'] = 42

# Rows can also be appended

ws.append([1, 2, 3])

# Python types will automatically be converted

import datetime

ws['A2'] = datetime.datetime.now()

# Save the file

wb.save("sample.xlsx")

03.unoconv

https://www.oschina.net/p/unoconv

unoconv, known as Universal Office Converter, is a command-line tool that can be converted between any file format supported by LibreOffice/OpenOffice.

unoconv supports batch conversion of documents. It can also create PDF or Word (. doc) files in combination with asciidoc and DocBook 2odf / XHTML 2odt.

'''
//Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other.
//There are also good video learning tutorials and PDF e-books in the group!
'''
[dag@moria cv]$ make odt pdf html doc

rm -f *.{odt,pdf,html,doc}

asciidoc -b docbook -d article -o resume.xml resume.txt

docbook2odf -f --params generate.meta=0 -o resume.tmp.odt resume.xml

Saved resume.tmp.odt

unoconv -f odt -t template.ott -o resume.odt resume.tmp.odt

unoconv -f pdf -t template.ott -o resume.pdf resume.odt

unoconv -f html -t template.ott -o resume.html resume.odt

unoconv -f doc -t template.ott -o resume.doc resume.odt

04.PyPDF2

https://www.oschina.net/p/pypdf

PyPDF2 is a pure Python PDF library that can split, merge, tailor and transform PDF file pages. It can also add custom data, view options and passwords to PDF files.

PyPDF2 can retrieve text and metadata from PDF or merge the entire file.

'''
//Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other.
//There are also good video learning tutorials and PDF e-books in the group!
'''
from PyPDF2 import PdfFileWriter, PdfFileReader

output = PdfFileWriter()

input1 = PdfFileReader(open("document1.pdf", "rb"))

# print how many pages input1 has:

print "document1.pdf has %d pages." % input1.getNumPages()

# add page 1 from input1 to output document, unchanged

output.addPage(input1.getPage(0))

# add page 2 from input1, but rotated clockwise 90 degrees

output.addPage(input1.getPage(1).rotateClockwise(90))

# add page 3 from input1, rotated the other way:

output.addPage(input1.getPage(2).rotateCounterClockwise(90))

# alt: output.addPage(input1.getPage(2).rotateClockwise(270))

# add page 4 from input1, but first add a watermark from another PDF:

page4 = input1.getPage(3)

watermark = PdfFileReader(open("watermark.pdf", "rb"))

page4.mergePage(watermark.getPage(0))

output.addPage(page4)

# add page 5 from input1, but crop it to half size:

page5 = input1.getPage(4)

page5.mediaBox.upperRight = (

    page5.mediaBox.getUpperRight_x() / 2,

    page5.mediaBox.getUpperRight_y() / 2

)

output.addPage(page5)

# add some Javascript to launch the print window on opening this PDF.

# the password dialog may prevent the print dialog from being shown,

# comment the the encription lines, if that's the case, to try this out

output.addJS("this.print({bUI:true,bSilent:false,bShrinkToFit:true});")

# encrypt your new PDF and add a password

password = "secret"

output.encrypt(password)

# finally, write "output" to document-output.pdf

outputStream = file("PyPDF2-output.pdf", "wb")

output.write(outputStream)

05.Mistune

http://mistune.readthedocs.io/

Mistune is a Python-only Markdown parser with complete functions, including tables, annotations, code blocks, etc.

Mistune is said to be the fastest of all pure Python markdown parsers (benchmark results). It is designed with modularity in mind to provide a clear and easy-to-use extensible API.

import mistune

mistune.markdown('I am using **mistune markdown parser**')

# output: <p>I am using <strong>mistune markdown parser</strong></p>

06.csvkit

https://www.oschina.net/p/csvkit

csvkit is known as Swiss Army knife for processing CSV files. It integrates csvlook, csvcut and csvsql and other practical tools. It can display CSV files in tabular form, easily select the designated columns of CSV, and perform SQL operations on them.

csvkit is a command-line tool inspired by pdftk, gdal and other similar tools.

Posted by xinnex on Thu, 12 Sep 2019 02:49:36 -0700