Different File Format(Avro vs Parquet vs JSON vs XML vs Protobuf vs ORC)

Baisali Pradhan
2 min readJun 19, 2021

--

These file formats are divided into 2 types.

Binary formats are machine-readable whereas Non-binary Formats are human-readable.

Binary formats are scalable and Preferred for distributed systems. Whereas Non-binary formats are limited scope in Big Data\Hadoop systems due to limits in terms of their Scalability and Parallelism.

Binary formats can be split across multiple disks or servers. But the Non-binary format can’t be split.

Binary formats are used if data, messages need to be exchanged between two or more services whereas non-binary formats are used if data, messages need to be exchanged between browsers or tools.

Parquet, ORC :

Stores data in columns oriented.
Good for analytical read-heavy applications.
Parquet is very much used in spark applications. whereas ORC is heavily used in Hive.

Avro & Protobuf :

Stores data in rows.
Good for write-heavy applications like transaction systems.
used for Kafka messages.
Very adoptive for Schema Evolution.

JSON :

It is used for Browser-based applications.
JSON is quicker to read and write.
It is extended from JavaScript.

XML :

XML data is in a string format.
XML file is larger. If we want to represent the data in XML then it would create a larger file as compared to JSON.
XML data is represented in tags, i.e., start tag and end tag.

--

--