Working with Avro

Posted on December 8, 2023 by Tylor Kobierski

I’ve been trying to build a basic event-driven application working with Apache Kafka, just to see how it works.

As a bit of a stretch goal, I’ve been working with Avro as a way of defining schemas. My hope is to use them as a way to define a mutual schema between two different services, and then have that be a basis to evolve the schema in a safe and fully backward compatible manner.

To do this, I’m using Confluent’s schema registry, but this is not an endorsement; it just happens to be the tool that I’m working with right now. You could likely roll your own or use a similar service if you wanted to.

What does an Avro schema look like? Well, like almost everything else these days, it’s a JSON object.

{
    "namespace": "org.example",
    "name": "RecordCreatedEvent",
    "doc": "This defines that a record has been created",
    "fields": [
        { "name": "recordId", "type": "long", "doc": "The record that was created." }
    ]
}

You can use it to define a record of some sort, and the fields that go into it. If you use java, you could then use that schema file to generate a class which can serialize and deserialize messages of that type, according to a plugin.

The Avro plugin will generate classes that have a builder. One of the neat things about the format is that each message as a reference to the schema that generated it, so you should be able to read the message just by virtue of grabbing the schema attached to the message. With the right serdes (serializer/deserialier), it’s easy to restore the message, even with a topic that receives multiple kinds of events.

A tricky thing that I did want to do is reuse objects across messages. For instance, I wanted to reuse an object that would contain a little bit of audit info that would be carried along with the message. There are some conflicting reports on the internet if this is possible or not, but I can say that it is possible, and not even hard, but it’s not necessarily obvious at first glance. There are two ways that you can do it, if you are unwilling to repeat yourself. The second way is preferable.

You can define a union type, in which you define all your records into a singular file:

[
    { 
        "type": "record", 
        "namespace": "org.example",
        "name": "AuditInfo",
        "fields": [
            { "name": "username", type: "string" },
            { "name": "created_date", type: "string" }
        ]
    },
    {
        "type": "record",
        "namespace": "org.example",
        "name": "RecordCreatedEvent",
        "fields": [
            { "name": "recordId", "type": "long" },
            { "name": "created", "type": "org.example.AuditInfo" }
        ]
    },
    {
        "type": "record"
        "namespace": "org.example",
        "name": "RecordDeletedEvent",
        "fields": [
            { "name": "recordId", "type": "long" },
            { "name": "created", "type": "org.example.AuditInfo" }
        ]
    }
]

This will work in almost every case, though it comes with the weakness that, were you using this to define an event topic that can consume multiple events, someone could conceivably use it to send one of these suboordinate records. However, you can also just define them in their own files. When you go to use them, you will have to explicitly reference the file in whatever you use to generate your classes. For instance, were you to use the Apache Avro plugin for maven, your suboordinate classes would appear in your <import> lines:

<plugin>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-maven-plugin</artifactId>
    <version>${avro.version}</version>
    ...
    <configuration>
        <imports>
            <import>avro/org/example/audit-info.avsc</import>
        </imports>
        <sourceDirectory>src/main/avro/</sourceDirectory>
        <outputDirectory>src/main/java/</outputDirectory>
    </configuration>
</plugin>

Now, your AuditInfo class can be referenced by any of your other classes without problem! I’ve seen a few other ways of specifying references to other classes depending on your library and language. You’ll likely have to look at your documentation but it should exist.

Another tricky thing that I ran into, was that field ordering matters, unlike say, using a JSON object of some sort. If you were to change the field ordering in a schema between two fields of the same type, it can result in values actually being transposed. Luckily, that’s not too hard to avoid.

Overall, despite the speedbumps, I’ve enjoyed my time working with Avro and the schema registry. Having worked on projects in which applications did not document their integrations with each other, having a clear file defining the integration was a breath of fresh air, and the idea of both applications referring to that integration using the same source of truth is appealing. It’ll be cool to see how this ends up helping the application evolve in the future!