How to search and analyse big leaks like the Panama Papers by free software and open data

Want to be able to search and analyze some Panama Papers data by open source research tools and open data without posting sensitive documents and search queries to cloud services?

Free software for your own semantic search engine based on Open Source enterprise search

Download the Open Semantic Desktop Search virtual machine.

Hint: Since the Virtual machine appliance file size is 2 GB, you should wait for new Open Semantic Search release next month with some new features and improved user interfaces.

First extract and index the text from data or document files:

Original documents (unstructured data)

If you got some original documents, config their folder(s) to be indexed (add it as shared folder with the search engine virtual machine like described in the configuration documentation).

Open Semantic Search will extract text from digital documents with different extractors and from scanned documents by Optical Character Recognition (OCR) and do indexing after start of the Virtual Machine (VM).

Aggregated and filtered data (structured data)

If you have the Panama Papers CSV tables download from ICIJ:

Copy the CSV files (Addresses.csv, Entities.csv and so on...) to an folder, which is configured to be an indexed folder.

After start of the virtual machine Open Semantic Search will extract and index the CSV tables rows and fields from the CSV tables for you automatically.

That this does not happen automatically if the CSV files are yet inside the ZIP archive is a bug (too much text data inside the ZIP for extraction in one step) which will be fixed within next weeks.

Search

After indexing the CSV tables or documents you can open the search interface:

This user interface allows you to search the index with powerful search operators i.e. for germany OR german* OR DEU OR Berlin OR Deutsch* OR "Steffan Mappus" or some other names.

Indexing and searching runs on your own computer without using cloud services, so what you search for can not be seen, spied and stored by external cloud services.

Manage named entities like companies or polititians

You can add names (for example companies or politicians) to the Named Entities Manager.

Faceted search (Aggregated overview and interactive filters)

This enable an aggregated overview for this Named Entities like persons or organizations:

No this entities are not an analysis of the panama leaks, i was too lazy to do new screenshots. But the screenshot shows you how you can use this overviews or how to use named entities as interactive filter to narrow down search results.

So a click to a facet (i.e. an organization) will drill down the search results to fewer documents, matching this additional facet/filter, too.

How to get leads by using watchlists and Open Data

But you dont have to add each potential name yourself to add some structure or watchlists like for example names of important people or polititians.

Just import Open Data to the Lists and Ontologies Manager:

List of names of people of interest like politicians

Get a list of names of politians from your country for example from Wikipedias structured database Wikidata:

Add this list with the Lists and Ontologies Manager. So you get another search facet and overview about "Persons" occuring in the documents.

Open Data list of location names

Another option is to filter by town names of your country:

Get a list of town names of your country, for example from GeoNames, from Openstreetmap or from Wikidata.

Add this list with the Lists and Ontologies Manager. So you get another search facet and overview about "Locations" occuring in the documents, which are in your country.

So you can filter for towns in your country and see which people occure within the documents which maybe are yet unknown and not in your lists of persons but very interesting for the searched context (i.e. your country).

Technical hint: Your will get more performance, if you add structure like named entities before indexing

Fuzzy search

Since no document scan, no optical character recognition (OCR) and no person filling out a form is perfect:

To be able to find at least some of bad quality data if OCR was bad or despite typos, use fuzzy search.

You dont have to do fuzzy search for each name. You can do this by a list with our Fuzzy search by lists tool.