In Hadoop if you use Hive if you try to have different schemas for different partition , you cannot have field inserted in middle.
If the fields are added in end you can use Hive natively.
However things would break if field is inserted in middle.
There are few ways to handle schema evolution and changes in hadoop
Use Avro
For flat schema of a database tables ( or files ) , generate avro schema. This Avro schema can be used anywhere in programming or mapping it with hive using AvroSerde
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html
I am exploring various JSON apis which can be used and also exploring various methods i can do.
http://www.infoq.com/articles/AVROSchemaJAXB
Nokia has released code to generate Avro schemas from XMLs
https://github.com/Nokia/Avro-Schema-Generator
Okay my problem statement and solution are simple.
The ideas in my mind are
- Store Schema details of Table in some database
- Read the database field details and generate Avro schema
- Store it to some location in Hadoop /schema/tableschema
- Map the Hive to use this avro schema location in HDFS
- If some change comes in schema update the database and the system would again generate new avro schema
- Push the new schema to HDFS
- Hive would use new schema without breaking old data should be able to support schema changes and evolution for data in Hadoop
Most NoSQL databases have similar approach , check Oracle link below
http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/avroschemas.html
Oracle NoSQL solution manages the schema information and changes in KeyStore
http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/provideschema.html
Use ORC
https://github.com/hortonworks/orc
Hortonworks guys are working on new file format which have similar feature of storing schema within data like Avro
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
The "versioned metadata" means that the ORC file's metadata is stored in ProtoBufs so that we can add (or remove) fields to the metadata. That means that for some changes to ORC file format we can provide both forward and backward compatibility.
ORC files like Avro files are self-describing. They include the type structure of the records in the metadata of the file. It will take more integration work with hive to make the schemas very flexible with ORC.
Such a nice blog with the attractive reference links which give the basic ideas on the topic.
ReplyDeleteData Science vs Business Analytics
Business Data Analytics
This is a really authentic and informative blog. Share more posts like this.
ReplyDeletePhonetics Sounds With Examples
Basics Of Phonetics
Thank you for the useful information. Share more updates.
ReplyDeleteIdioms
Speaking Test
Good Post. I like your blog. Thanks for Sharing
ReplyDeletemachine learning course