Using the Toolbox
The objective of querido-diario-toolbox is to equip Querido Diário’s (QD) community with the tools necessary to conduct their own analyses and data manipulations using data obtained through QD. Additionally, the library will be integrated into the production applications used by Querido Diário, meaning that anyone using the library will be able to locally reproduce the same data processing steps performed by QD.
The library provides many levels of abstractions to work with the data. Ranging from simple text cleaning using strings to converting files from various formats into plain text.
Installing
pip install querido-diario-toolbox
Currently, querido-diario-toolbox is compatible with Python 3.8+.
To perform the text extractions it’s necessary to install Tesseract OCR, as
well as the .jar
files from Apache Tika (last tested version: 1.24.1) and
Tabula (last tested version: v1.0.4) accessible in order to pass their file
paths as arguments.
Use case
More elaborate examples are available in the examples folder. You can view them (and interact if you wish) using Jupyter notebooks.
Removing unnecessary spaces in text
from querido_diario_toolbox.process.text_process import remove_breaks
texto = "\n\n\nThis text has many white spaces\n\n \nunnecessary.\n"
remove_breaks(texto)
'This text has many white spaces unnecessary.'
Finding valid CNPJs in text
from querido_diario_toolbox.process.edition_process import extract_and_validate_cnpj
texto = "The companies with valid CNPJs 00.000.000/0001-91 and 00.360.305/0001-04 exist, but the one with CNPJ 12.123.123/1234.12 does not exist..."
extract_and_validate_cnpj(texto)
['00.000.000/0001-91', '00.360.305/0001-04']
Converting file from closed format to plain text and extracting metadata
from querido_diario_toolbox import Gazette
from querido_diario_toolbox.etl.text_extractor import create_text_extractor
config = {"apache_tika_jar": "caminho/apache/tika/jar/tika-app-1.24.1.jar"}
extrator = create_text_extractor(config)
diario = Gazette(filepath="caminho/diario/fechado/diario.pdf")
extrator.extract_text(diario)
extrator.extract_metadata(diario)
extrator.load_content(diario)
After the execution of extrator.load_content(diario)
, two files (a .txt
with pure text and a .json
with metadata) will be created.