Handle schema changes evolution in Hadoop

In Hadoop if you use Hive if you try to have different schemas for different partition , you cannot have field inserted in middle.

If the fields are added in end you can use Hive natively.

However things would break if field is inserted in middle.

There are few ways to handle schema evolution and changes in hadoop

Use Avro

For flat schema of a database tables ( or files ) ,  generate avro schema. This Avro schema can be used anywhere in programming or mapping it with hive using AvroSerde

https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html

I am exploring various JSON apis which can be used and also exploring various methods i can do.

http://www.infoq.com/articles/AVROSchemaJAXB

Nokia has released code to generate Avro schemas from XMLs

https://github.com/Nokia/Avro-Schema-Generator

Okay my problem statement and solution are simple.

The ideas in my mind are

  1. Store Schema details of Table in some database
  2. Read the database field details and generate Avro schema
  3. Store it to some location in Hadoop  /schema/tableschema
  4. Map the Hive to use this avro schema location in HDFS
  5. If some change comes in schema update the database and the system would again generate new avro schema
  6. Push the new schema to HDFS
  7. Hive would use new schema without breaking old data should be able to support schema changes and evolution for data in Hadoop

Most NoSQL databases have similar approach , check Oracle link below

http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/avroschemas.html

Oracle NoSQL solution manages the schema information and changes in KeyStore

http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/provideschema.html

 

Use ORC

https://github.com/hortonworks/orc

Hortonworks guys are working on new file format which have similar feature of storing schema within data like Avro

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

The "versioned metadata" means that the ORC file's metadata is stored in ProtoBufs so that we can add (or remove) fields to the metadata. That means that for some changes to ORC file format we can provide both forward and backward compatibility.

ORC files like Avro files are self-describing. They include the type structure of the records in the metadata of the file. It will take more integration work with hive to make the schemas very flexible with ORC.

4 comments:

Please share your views and comments below.

Thank You.