The Open Source Community Tooling Built on Avro

Continuing my march through the event-driven and message-driven world of API specifications I am workking my way through the open source tooling that is built on the Avro specification.  I am looking to better understand how the data serialization system is being put to work, and the relationship with the other layers of the API specification conversation. Here is the top tooling I'm tracking on when it comes to Avro, organized by group.

Specification

  • avro - (forks: 1066) (stars: 1594) (watchers: 1594) - apache avro is a data serialization system.

Registries

  • schema registry - (forks: 736) (stars: 1234) (watchers: 1234) - confluent schema registry for kafka
  • schema registry ui - (forks: 88) (stars: 321) (watchers: 321) - web tool for avro schema registry |
  • schemer - (forks: 3) (stars: 90) (watchers: 90) - schema registry for csv, tsv, json, avro and parquet schema. supports schema inference and graphql api.

Queries

  • rq - (forks: 45) (stars: 1553) (watchers: 1553) - record query - a tool for doing record analysis and transformation

Education

  • examples - (forks: 458) (stars: 670) (watchers: 670) - apache kafka and confluent platform examples and demos
  • kafka storm starter - (forks: 335) (stars: 726) (watchers: 726) - code examples that show to integrate apache kafka 0.8+ with apache storm 0.9+ and apache spark streaming 1.1+, while using apache avro as the data serialization format.
  • avro hadoop starter - (forks: 86) (stars: 111) (watchers: 111) - example mapreduce jobs in java, hive, pig, and hadoop streaming that work on avro data.
  • Avro2TF - (forks: 19) (stars: 118) (watchers: 118) - avro2tf is designed to fill the gap of making users' training data ready to be consumed by deep learning training frameworks.

Serialization

  • avsc - (forks: 98) (stars: 844) (watchers: 844) - avro for javascript :zap:
  • avro4s - (forks: 178) (stars: 536) (watchers: 536) - avro schema generation and serialization / deserialization for scala
  • fastavro - (forks: 115) (stars: 362) (watchers: 362) - fast avro for python
  • gogen avro - (forks: 66) (stars: 191) (watchers: 191) - generate go code to serialize and deserialize avro schemas
  • avrohugger - (forks: 82) (stars: 147) (watchers: 147) - generate scala case class definitions from avro schemas
  • scalavro - (forks: 31) (stars: 119) (watchers: 119) - a reflection-based avro library in scala.
  • abracad - (forks: 31) (stars: 107) (watchers: 107) - a clojure library for de/serializing clojure data structures with avro.
  • python avro json serializ - (forks: 32) (stars: 104) (watchers: 104) - serializes data into a json format using avro schema.
  • avro_turf - (forks: 44) (stars: 97) (watchers: 97) - a library that makes it easier to use the avro serialization format from ruby.
  • avro rs - (forks: 48) (stars: 89) (watchers: 89) - avro client library implementation in rust
  • json schema avro - (forks: 22) (stars: 102) (watchers: 102) - avro to json schema, and back
  • jsAvroPhonetic - (forks: 56) (stars: 84) (watchers: 84) - a javascript implementation of avro phonetic
  • kafka avro - (forks: 34) (stars: 76) (watchers: 76) - node.js bindings for librdkafka with avro schema serialization.
  • pyavroc - (forks: 17) (stars: 46) (watchers: 46) - an avro file reader/writer for python
  • BlueSteel - (forks: 15) (stars: 47) (watchers: 47) - an avro encoding/decoding library for swift.
  • libserdes - (forks: 35) (stars: 36) (watchers: 36) - avro serialization/deserialization c/c++ library with confluent schema-registry support
  • vulcan - (forks: 8) (stars: 46) (watchers: 46) - functional avro for scala
  • avro schema - (forks: 2) (stars: 48) (watchers: 48) - apache avro schema tools for tarantool

Generators

  • xml avro - (forks: 56) (stars: 58) (watchers: 58) - generate avro schema and avro binary from xsd schema and xml

Connectors

  • spark avro - (forks: 316) (stars: 535) (watchers: 535) - avro data source for apache spark
  • cpp serializers - (forks: 82) (stars: 484) (watchers: 484) - benchmark comparing various data serialization libraries (thrift, protobuf etc.) for c++

Code Generation

  • gradle avro plugin - (forks: 53) (stars: 135) (watchers: 135) - a gradle plugin to allow easily performing java code generation for apache avro. it supports json schema declaration files, json protocol declaration files, and avro idl files.
  • sbt avrohugger - (forks: 37) (stars: 95) (watchers: 95) - sbt plugin for generating scala sources for apache avro schemas and protocols.
  • avromatic - (forks: 11) (stars: 56) (watchers: 56) - generate ruby models from avro schemas

Tabular

  • iceberg - (forks: 48) (stars: 363) (watchers: 363) - iceberg is a table format for large, slow-moving tabular data

Toolchains

  • DevOps Python tools - (forks: 152) (stars: 310) (watchers: 310) - 80+ devops & data cli tools - aws, log anonymizer, spark, hadoop, hbase, hive, impala, linux, docker, spark data converters & validators (avro/parquet/json/csv/ini/xml/yaml), travis ci, ambari, blueprints, cloudformation, elasticsearch, solr, pig, ipython - python / jython tools
  • bigdata playground - (forks: 54) (stars: 157) (watchers: 157) - a complete example of a big data application using : kubernetes (kops/aws), apache spark sql/streaming/mlib, apache flink, scala, python, apache kafka, apache hbase, apache parquet, apache avro, apache storm, twitter api, mongodb, nodejs, angular, graphql

Data Store

  • chana - (forks: 50) (stars: 332) (watchers: 332) - avro data store based on akka

Data Generation

  • ratatool - (forks: 45) (stars: 251) (watchers: 251) - a tool for data sampling, data generation, and data diffing

Conversion

  • json wikipedia - (forks: 41) (stars: 241) (watchers: 241) - json wikipedia, contains code to convert the wikipedia xml dump into a json/avro dump
  • json avro converter - (forks: 60) (stars: 158) (watchers: 158) - json to avro conversion tool designed to make migration to avro easier.

Database

  • storagetapper - (forks: 46) (stars: 205) (watchers: 205) - storagetapper is a scalable realtime mysql change data streaming, logical backup and logical replication service

Binary

  • jackson dataformats binar - (forks: 67) (stars: 187) (watchers: 187) - uber-project for standard jackson binary format backends: avro, cbor, protobuf, smile

IDE

  • vscode data preview - (forks: 20) (stars: 168) (watchers: 168) - data preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large json array/config, yaml, apache arrow, avro & excel data files

Documentation

  • avrodoc - (forks: 60) (stars: 121) (watchers: 121) - documentation tool for avro schemas

Validation

  • aptos - (forks: 16) (stars: 141) (watchers: 141) - :sunny: a tool for validating data using json schema and converting json schema documents into different data-interchange formats

Command Line Interface

  1. schema registry - (forks: 24) (stars: 96) (watchers: 96) - a cli and go client for kafka schema registry

Semantics

  1. schema_salad - (forks: 33) (stars: 40) (watchers: 40) - semantic annotations for linked avro data

Like JSON Schema, Avro is a very data centric specification. I need to better understand how it is used by leading providers like Confluent for powering Kafka, but I also want to better understand its relationship to JSON Schema, and how it is used for AsyncAPI and OpenAPI. This dive provided me with a fresh look at how the API space is evolving, and also how data and our databases are still king when it comes to everything API.