Sunday, April 8, 2018

So long, Blogger

Well, Horothesia has had a good decade+ here at blogger, but I feel it's time for a change. I've copied the entire archive to my new site and blog at paregorios.org and it's there that I'll post any new blog content.

Friday, February 2, 2018

Preserving Accented and Non-Roman Characters in CSV Workflows

Digital work in and around the Humanities often involves moving data from one system or format to another. That data often involves complex textual materials in multiple languages and writing systems. One commonly used format is the "Comma-Separated Values" text file. It's not uncommon to find that characters not used in English get garbled when exported from a spreadsheet program like Microsoft Excel to CSV (or imported from CSV into such a program). What's going on and how do you make it stop?

Why

CSV began life in an era before Unicode and, because of that background, some software assumes that CSV should be encoding using the ASCII text encoding scheme (some older versions of Excel). Some software defaults to using ASCII, but you can override it manually (more recent versions of Excel). Some software tries to guess what encoding to use when reading or writing a given CSV file, but how it guesses may not be foolproof. Some software writes a special code called a Byte-Order Mark (BOM) into the beginning of any CSV file that uses a Unicode-aware encoding (Excel for Mac 2016). Some software doesn't expect a BOM and will fail to read the data correctly even if the encoding (e.g., UTF-8) is otherwise supported.

How to make it stop

The best way to make it stop is to:
  1. Make sure that any CSV file you import or export is encoded in UTF-8 without a Byte-Order Mark.
  2. Make sure that any software you're using is capable of reading and writing CSV files in UTF-8 without BOM and has been told to do so.
Failing that (i.e., you got the CSV file from someone else), you need to find out what encoding it uses and configure your software to read it properly. But note that if the creator of the CSV file allowed it to be written in ASCII, it can't be repaired. You'll have to get them to re-export properly, or to send you the original file so you can open it in appropriate software and save it more deliberately.