OpenRefine offers lots of features to help you learn about your dataset, even if you don’t change a single character. In this section we cover different ways for sorting through, filtering, and viewing your data.
Unlike spreadsheets, OpenRefine doesn’t store formulas and display the output of those calculations; it only shows the value inside each cell. It doesn’t support cell colors or text formatting.
Each piece of information (each cell) in OpenRefine is assigned a data type. Some file formats, when imported, can set data types that are recognized by OpenRefine. Cells without an associated data type on import will be considered a “string” at first, but you can have OpenRefine convert cell contents into other data types later. This is set at the cell level, not at the column level.
You can see data types in action when you preview a new project: check the box next to Attempt to parse cell text into numbers, and cells will be converted to the “number” data type based on their contents. You’ll see numbers change from black text to green if they are recognized.
The data type will determine what you can do with the value. For example, if you want to add two values together, they must both be recognized as the number type.
You can check data types at any time by:
- clicking “edit” on a single cell (where you can also edit the type)
- creating a
type(value)into the Expression field. This will generate the data type in the preview, and you can facet by data type if you press . on a column, and inserting
The data types supported are:
- string (one or more text characters)
- number (one or more characters of numbers only)
- boolean (values of “true” or “false”)
- date (ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ)
OpenRefine recognizes two further data types as a result of its own processes:
An “error” data type is created when the cell is storing an error generated during a transformation in OpenRefine.
A “null” data type is a special type that means “this cell has no value.” It’s distinct from cells that have values such as “0” or “false”, or cells that look empty but have whitespace in them, or cells that contain empty strings. When you use
type(value), it will show you that the cell’s value is “null” and its type is “undefined.” You can opt to show “null” values, by going to → → .
Changing a cell's data type is not the same operation as transforming its contents. For example, using a column-wide transform such as→ → may not convert all values successfully, but going to an individual cell, clicking “edit”, and changing the data type can successfully convert text to a date. These operations use different underlying code. Learn more about date formatting and transformations in the next section.
Date-formatted data in OpenRefine relies on a number of conversion tools and standards. For something to be considered a date in OpenRefine, it will be converted into the ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.
When you run→ → , the following column of strings on the left will transform into the values on the right:
|2012 02 16||→||2012-02-16T00:00:00Z|
|August 2nd 1964||→||1964-08-02T00:00:00Z|
OpenRefine uses a variety of tools to recognize, convert, and format dates and so some of the values above can be reformatted using other methods. In this case, clicking the “today” cell and editing its data type manually will convert “today” into a value such as “2020-08-14T00:00:00Z”. Attempting the same data-type change on “never” will give you an error message and refuse to proceed.
You can do more precise conversion and formatting using expressions and arguments based on the state of your data: see the GREL functions reference section on Date functions for more help.
You can convert dates into a more human-readable format when you export your data using the custom tabular exporter. You are given the option to keep your dates in the ISO 8601 format, to output short, medium, long, or full locale formats, or to specify a custom format. This means that you can format your dates into, for example, MM/DD/YY (the US short standard) with or without including the time, after working with ISO-8601-formatted dates in your project.
The following table shows some example date and time formatting styles for the U.S. and French locales:
|Style||U.S. Locale||French Locale|
|Default||Jun 30, 2009 7:03:47 AM||30 juin 2009 07:03:47|
|Short||6/30/09 7:03 AM||30/06/09 07:03|
|Medium||Jun 30, 2009 7:03:47 AM||30 juin 2009 07:03:47|
|Long||June 30, 2009 7:03:47 AM PDT||30 juin 2009 07:03:47 PDT|
|Full||Tuesday, June 30, 2009 7:03:47 AM PDT||mardi 30 juin 2009 07 h 03 PDT|
Rows vs. records
A row is a simple way to organize data: a series of cells, one cell per column. Sometimes there are multiple pieces of information in one cell, such as when a survey respondent can select more than one response.
In cases where there is more than one value for a single column in one or more rows, you may wish to use OpenRefine’s records mode: this defines a single record as potentially containing more than one row. From there you can transform cells into multiple rows, each cell containing one value you’d like to work with.
Generally, when you import some data, OpenRefine reads that data in row mode. From the project screen, you can convert the project into records mode. OpenRefine remembers this action and will present you with records mode each time you open the project from then on.
OpenRefine understands records based on the content of the first column, what we call the “key column.” Splitting a row into a multi-row record will base all association on the first column in your dataset.
If you have more than one column to split out into multiple rows, OpenRefine will keep your data associated with its original record, and associate subgroups based on the top-most row in each group.
You can imagine the structure as a tree with many branches, all leading back to the same trunk.
For example, your key column may be a film or television show, with multiple cast members identified by name, associated to that work. You may have one or more roles listed for each person. The roles are linked to the actors, which are linked to the title.
|The Wizard of Oz||Judy Garland||Dorothy Gale|
|The Tin Man|
|The Cowardly Lion|
|Frank Morgan||Professor Marvel|
|The Carriage Driver|
|The Wizard of Oz|
|Margaret Hamilton||Miss Almira Gulch|
|The Wicked Witch of the West|
Once you are in records mode, you can still move some columns around, but if you move a column to the beginning, you may find your data becomes misaligned. The new key column will sort into records based on empty cells, and values in the old key column will be assigned to the last row in the old record (the key value sitting above those values).
OpenRefine assigns a unique key behind the scenes, so your records don’t need a unique identifier in the key column. You can keep track of which rows are assigned to each record by the record number that appears under thecolumn.
Be careful when in records mode that you do not accidentally delete rows based on being blank in one column where there is a value in another.