Skip to main content

Update: Google News Initiative and technical road map

· 6 min read
Martin Magdinier

The article present progress made following the funding from Google News Initiative.

Recent Progress

OpenRefine 2.8 and 3.0

OpenRefine 3.0 Beta was released on May 27. This is the second release with the support of Google News Initiative (see more details below) and includes several major enhancements including:

  • Wikidata extension to reconcile, enrich and push data from/to Wikidata. See the video tutorial and Wikidata documentation (2.8 and 3.0);
  • Support of Data package metadata standards to describe and package a collection of data (3.0)
  • Improved metadata and tag system to ease project identification from OpenRefine homepage (2.8 and 3.0);
  • Fix the connection with Google Drive to import (3.0);
  • Support of HTTP Header when sending a request (3.0);
  • Create project from a database connection and generate INSERT INTO statement (3.0);
  • Support split multivalued-cells by regex/special characters (2.8);
  • Several improvements of the reconciliation API (2.8) ;
  • Documentation improvement on our wiki;

OpenRefine 3.0 is under beta, and we still have few bugs. If you want to use it please backup your workspace directory before installing and report any problems that you encounter.

Google News Initiative helped to design a new logo for OpenRefine. You may have noticed the new logo on our GitHub project. We made it part of the 3.0 release, and we will update it on the website when the deployment of the new website layout occurs in the coming weeks.

Translations

OpenRefine 3.0 is fully translated into five languages and available in seven more with over 87% coverage! Thanks a lot to all the community members who help to localize OpenRefine in their language! You can help complete existing translations or start a new language via Weblate. See our Translate OpenRefine wiki page for more details.

Translation available:

  • English 100.0% translated
  • Italian 100.0% translated
  • Japanese 100.0% translated
  • French 100.0% translated
  • German 100.0% translated
  • Portuguese (Brazil) 98.1% translated
  • Filipino 97.3% translated
  • Cebuano 97.5% % translated
  • Tagalog 97.2% translated
  • Hungarian 95.3% translated
  • Russian 93.4% translated
  • Chinese 90.0% translated
  • Spanish 89.3% translated
  • Hebrew 87.7% translated
  • Romanian 6.7% translated

Google News Initiative Support

In December 2017, Google News Initiative offered USD 100,000 to support the development of OpenRefine as described in the document Phase 1 and Phase 2+3. So far the funds have been used to finance the release of OpenRefine 2.8 and 3.0 (over 150 issues closed!) and covers expenses to present OpenRefine at the NICAR 18 and IRE 18 conferences.

Governance

The OpenRefine core team is currently managing the funds via a private US company to get off the ground quickly. We know that the private US corporation is not a viable solution in the long term and we are currently under discussion to join Software Freedom Conservancy.

As we begin to use Software Freedom Conservancy as the parent organization of OpenRefine Foundation, we will clarify and formalize our governance model. In the meanwhile, we welcome anyone feedbacks and comments via our GitHub project or our user and developer mailing lists. If you would like to request financing for the development of a feature, the attendance or the organization of an event, please reach out via one of our mailing lists.

You can review our usage of the funds in the following Finance document.

OpenRefine roadmap

With the remaining funds available from Google News Initiative, we plan to address the tight coupling between the front and back end and set up the OpenRefine Foundation for the next phases. The current implementation presents technical barriers for new contributors to engage with the community and limits the attractiveness of OpenRefine for developers.

Currently, all OpenRefine contributors are working on the project on top of their other academics, professional lives, and family commitments. While the Google funds help the core team to free some extra time to support OpenRefine, part-time involvement is not enough to address coming milestones (after all, we're a very small team!). We plan to hire fellow(s) / contractor(s) to work full time for a couple of months with the community and take the lead on the three following items (we will share more information regarding the fellow/contractor positions).

You can track the next phase in the Github project separation of Front End and Back End.

Prepare Front / Back-end separation

Our goal is to have a full API exposing all OpenRefine operation and commands. This first phase focuses solely on documenting OpenRefine operations and commands available and designing the different API endpoints. The idea is to start small and grow the API over time by allowing for extensibility. In the process, we will take into account existing extensions developed for OpenRefine and support backward compatibility as much as we can.

Datagrid Enhancements

A faster grid technology with virtual DOM will set the foundation for faster scrolling, and interactive elements like drag and drop column moving and resizing, and column and row grouping to better support record mode (see the dedicated GitHub project for UI Improvements. During the implementation, we can remove the custom bits of the frontend that are coupled directly to the back-end. The new datagrid implementation is tracked via #1347.

Data Storage Enhancements

The group is looking at using Apache Arrow to improve the in-memory data model. An alternative, higher performance in-memory / on-disk data storage technology will help increase OpenRefine’s data processing capacity. The implementation of Apache Arrow is tracked in #1469 and can be done independently from the API and datagrid implementation.

Following Phases

Following the completion of the three previous phases, we want to:

  • Implement the new API;
  • Leverage the new API and datagrid to move away from Jquery-based components and using React or Angular;
  • Ensure that we have extendable API points for our most important components;
  • Thanks to the decoupling of the front and back, we can answer the numerous requests from the community to support new data processing engines like Apache Spark or Google DataFlow / DataProc. Those items are tracked in the project Performance Improvements;
  • Keep supporting new features from requested from the community;