Character Limits when Loading Wide Fixed Width Files into Hive using Serde.*

hive-logo

Serde (Serializer and Deserializer) provides the functionality to read fixed width files into Hive. It is fairly simple to use once the syntax is mastered. One simply sets the width of each field in the fixed width file and the file is split into columns of the corresponding widths. However, Serde has difficulty with very wide files, for instance 1000 columns. The reason for this can be found: https://issues.apache.org/jira/browse/HIVE-1364

Because of the previous versions of the various hive metastore databases (e.g. MySQL, Oracle) that could be used the original character limit for a Serde string was 767. This particular ticket increased the capacity to a character limit of 4,000. However, in my recent experience, this was not enough. The Serde regular expression field is truncated to 4,000 characters. (E.g. If you have column definitions that span 5,000 characters, everything after 4000 is deleted.) The solution to this problem is simple (if you have admin privileges): Update the datatypes in the hive metastore to store more than 4000 characters for the Serde regex. Increasing the size of the datatype will not affect the metastore table, and all metastore database types have a capacity beyond 4,000. 8,000 characters of regular expressions to delimit the fixed widths files was sufficient for my use case.

*The purpose of this blog is to direct people to a method of solving a nuanced problem to which I had a degree of difficulty finding a solution. This is not meant to be a tutorial, but in my case the remedy was simple, once I figured out the problem. If you are one of a relatively small group of people who happen to run into this problem, I hope your Google Juju brought you here quickly.