Name: finnish-parliament-visitors
Owner: Open Knowledge Finland
Description: Finnish Parliament visitor register
Created: 2017-06-14 07:41:05.0
Updated: 2018-04-20 11:45:54.0
Pushed: 2018-04-19 17:55:45.0
Homepage: null
Size: 9957
Language: Shell
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Finnish Parliament Visitor Logs
This repo describes the process of digitalising the FOI responses from the Finnish Parliament visitor register. The requests are currently replied to with printed copies of the original spreadsheets.
So to transform them to usable digital format, the are photographed periodically, and apply image processing, OCR, and pdf table extraction (Tabula) to get CSVs to be processed. These scripts are provided below.
In short
(1) ImageMagick (for cleaning )
(2) Tesseract ( for OCR to searchable PDFs )
(3) Tabula ( for PDF to csv )
Name folders as metnioned below and give as input the directory containing all those you want to process. To repeat process on only some mistaken files, copy them in a temporary 'fixing' folder and run the 2 scripts on that folder.
Run : bash 1.processImages.sh
Depends On: ImageMagick -> https://www.imagemagick.org
Exports : A folder called 'processed_photos' in each of the above where the images found are color balanced and rotated if they are portraits.
Replies : Each subfolder has a 'processed_photos' with the results of the processing.
Run : bash 2.ocrImages.sh
Depends On :
Tabula-Java A working .jar version is included but make sure you have java 1.7+. https://github.com/tabulapdf/tabula-java/releases
Tesseract (3.05+) with FIN, SWE language packages
In short brew install tesseract
and brew install tesseract-<langcode>
https://github.com/tesseract-ocr/tesseract/
Asks for : main folder of folders (full path) give the same as the one before!
Replies :
Run : bash 3.collate_csv.sh
Depends On :
Asks for : main folder of folders (full path) give the same as the one before!
Replies :
Run : bash 4.test_all.sh
Depends On :
Asks for : main folder of folders (full path) give the same as the one before!
Replies :
Run : bash 5.post_ocr_all.sh
Depends On :
Asks for : main folder of folders (full path) give the same as the one before!
Replies :
(https://github.com/FourCoffees) (https://github.com/AleksiKnuutila)
The above is optimised for the following data structure:
For each date, there are two sets of pages. The first pages describe the people who the reception had been notified were coming (?Ilmoittautuneet?). The latter pages, starting again from the morning hours, list the people who had not been notified (?Ei ilmoittautuneet?). The second set of pages does not list the date. Hence it?s important to maintain the order of photographs and this is why we name them in the folders.
There are two separate registers for different buildings. These are filled in different folders, the green and the blue folder. The blue folder is for the temporary use of the Sibelius Academy building, and green folder for the parliament itself.
The administration removes certain information from the register before making it viewable. The only way they have described what they remove is by saying, it is for example the people who visit the Ombudsman (Oikeusasiamies).
These 2 scripts runs through all folders that have the following structure
BuildingRegister#date#Notified
BuildingRegister :
There are two separate registers for different buildings
(g or b) Blue folder is for the temporary use of the Sibelius Academy (b), Green folder for the parliament itself (g)
Date:
Date of the page register. This is hard to retrieve from individual pages (YYYY-MM-DD) and therefore has to be inputed in the folder as an extra metadata.
Notified :
Register of people who had notified were coming ?Ilmoittautuneet” (e) and register of people who had not notified ?Ei ilmoittautuneet? (ne).
The final output is a folder named CSV containing all the exported csvs. This will be created at the root of the folders. Each csv is named:
Date#BuildingRegister#Notified#photo-id.csv
2017-05-26#g#e#IMG_20170602_120608.jpg.csv