Below are some libraries written by Python for parsing and manipulating special text formats, hoping to help you.
01.Tablib
https://www.oschina.net/p/Tablib
Tablib is a Python library for processing table format data. It allows importing, exporting and managing table format data. It also has advanced functions such as slicing, dynamic columns, labels and filtering, and formatting import and export.
Tablib supports export/import formats including Excel, JSON, YAML, HTML, TSV and CSV, and does not support XML for the time being.
''' //Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other. //There are also good video learning tutorials and PDF e-books in the group! ''' >>> data = tablib.Dataset(headers=['First Name', 'Last Name', 'Age']) >>> for i in [('Kenneth', 'Reitz', 22), ('Bessie', 'Monke', 21)]: ... data.append(i) >>> print(data.export('json')) [{"Last Name": "Reitz", "First Name": "Kenneth", "Age": 22}, {"Last Name": "Monke", "First Name": "Bessie", "Age": 21}] >>> print(data.export('yaml')) - {Age: 22, First Name: Kenneth, Last Name: Reitz} - {Age: 21, First Name: Bessie, Last Name: Monke} >>> data.export('xlsx') <censored binary data> >>> data.export('df') First Name Last Name Age 0 Kenneth Reitz 22 1 Bessie Monke 21
02.Openpyxl
https://www.oschina.net/p/openpyxl
Openpyxl is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
Openpyxl was originally developed for Python's native read/write Office Open XML format based on PHPExcel.
''' //Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other. //There are also good video learning tutorials and PDF e-books in the group! ''' from openpyxl import Workbook wb = Workbook() # grab the active worksheet ws = wb.active # Data can be assigned directly to cells ws['A1'] = 42 # Rows can also be appended ws.append([1, 2, 3]) # Python types will automatically be converted import datetime ws['A2'] = datetime.datetime.now() # Save the file wb.save("sample.xlsx")
03.unoconv
https://www.oschina.net/p/unoconv
unoconv, known as Universal Office Converter, is a command-line tool that can be converted between any file format supported by LibreOffice/OpenOffice.
unoconv supports batch conversion of documents. It can also create PDF or Word (. doc) files in combination with asciidoc and DocBook 2odf / XHTML 2odt.
''' //Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other. //There are also good video learning tutorials and PDF e-books in the group! ''' [dag@moria cv]$ make odt pdf html doc rm -f *.{odt,pdf,html,doc} asciidoc -b docbook -d article -o resume.xml resume.txt docbook2odf -f --params generate.meta=0 -o resume.tmp.odt resume.xml Saved resume.tmp.odt unoconv -f odt -t template.ott -o resume.odt resume.tmp.odt unoconv -f pdf -t template.ott -o resume.pdf resume.odt unoconv -f html -t template.ott -o resume.html resume.odt unoconv -f doc -t template.ott -o resume.doc resume.odt
04.PyPDF2
https://www.oschina.net/p/pypdf
PyPDF2 is a pure Python PDF library that can split, merge, tailor and transform PDF file pages. It can also add custom data, view options and passwords to PDF files.
PyPDF2 can retrieve text and metadata from PDF or merge the entire file.
''' //Nobody answered the question? Xiaobian created a Python learning and communication QQ group: 857662006 to find like-minded partners to help each other. //There are also good video learning tutorials and PDF e-books in the group! ''' from PyPDF2 import PdfFileWriter, PdfFileReader output = PdfFileWriter() input1 = PdfFileReader(open("document1.pdf", "rb")) # print how many pages input1 has: print "document1.pdf has %d pages." % input1.getNumPages() # add page 1 from input1 to output document, unchanged output.addPage(input1.getPage(0)) # add page 2 from input1, but rotated clockwise 90 degrees output.addPage(input1.getPage(1).rotateClockwise(90)) # add page 3 from input1, rotated the other way: output.addPage(input1.getPage(2).rotateCounterClockwise(90)) # alt: output.addPage(input1.getPage(2).rotateClockwise(270)) # add page 4 from input1, but first add a watermark from another PDF: page4 = input1.getPage(3) watermark = PdfFileReader(open("watermark.pdf", "rb")) page4.mergePage(watermark.getPage(0)) output.addPage(page4) # add page 5 from input1, but crop it to half size: page5 = input1.getPage(4) page5.mediaBox.upperRight = ( page5.mediaBox.getUpperRight_x() / 2, page5.mediaBox.getUpperRight_y() / 2 ) output.addPage(page5) # add some Javascript to launch the print window on opening this PDF. # the password dialog may prevent the print dialog from being shown, # comment the the encription lines, if that's the case, to try this out output.addJS("this.print({bUI:true,bSilent:false,bShrinkToFit:true});") # encrypt your new PDF and add a password password = "secret" output.encrypt(password) # finally, write "output" to document-output.pdf outputStream = file("PyPDF2-output.pdf", "wb") output.write(outputStream)
05.Mistune
http://mistune.readthedocs.io/
Mistune is a Python-only Markdown parser with complete functions, including tables, annotations, code blocks, etc.
Mistune is said to be the fastest of all pure Python markdown parsers (benchmark results). It is designed with modularity in mind to provide a clear and easy-to-use extensible API.
import mistune mistune.markdown('I am using **mistune markdown parser**') # output: <p>I am using <strong>mistune markdown parser</strong></p>
06.csvkit
https://www.oschina.net/p/csvkit
csvkit is known as Swiss Army knife for processing CSV files. It integrates csvlook, csvcut and csvsql and other practical tools. It can display CSV files in tabular form, easily select the designated columns of CSV, and perform SQL operations on them.
csvkit is a command-line tool inspired by pdftk, gdal and other similar tools.