Introduction to protobuf
Problem Intro
Typically, JSON format is used to send data across systems. If you are in python land, you probably do things in a dictionary, serialize/unserialize it with json.dumps
and json.loads
respectively. If you are using a webframework such as flask
, you would use jsonify
and the get_data()
method.
For example,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import json
output = {'account_id': 1234, 'sales_amount': 40.0}
json_output = json.dumps(output)
type(json_output) #str
print(json_output)#{"account_id": 1234, "sales_amount": 40.0}
##### Using Flask ####
from flask import Flask
from flask import jsonify
app = Flask(__name__)
with app.app_context():
json_output = jsonify(output)
print(json_output) #<Response 40 bytes [200 OK]>
print(json_output.get_data()) #b'{"account_id":1234,"sales_amount":40.0}\n'
input = (json_output.get_data())
print(json.loads(input)) #{'account_id': 1234, 'sales_amount': 40.0}
Problem 1
Using JSON poses quite a few challenges, the sending/recieving party does not need to enforce the data types, making it hard to parse further downstream. For instance,
-
account_id
can be in string format"1234"
-
sales_amount
can be int40
(sounds familar?)
Problem 2
Among us data folks, we usually referrence the dictionary with keys. e.g data.get('sales_amount')
. What if additional fields is required, or change schema naming from sales_amount
to total_sales
. Then downstream code referrence such as data.get('sales_amount')
would fail.
In production ML systems making predictions, changing the schema naming convention would yield different results. This results in release cycles getting more complicated and systems being more tightly coupled.
Problem 3
Furthermore, in a typical data lake (which are very expensive to build and maintain), everything is usually dumped as json blobs no consistency, resulting in a significant technical debt, and data cleaning will definitely be painful. It is no wonder - data scientist spend more than 80% of their time cleaning data.
Ouch!
Introducing protobuf
Lets come up with a mock problem, that we are trying to send / parse the following data:
fields | data type |
---|---|
account_id | int |
sales_amount | float |
method | enum |
For those unfamiliar, enum
is factor
variables. For example, in this case, the method
can only take values of CASH
, CREDIT_CARD
, WALLET
.
You can read the google guide on protobufs1. Essentially, protobuf generates a python module/script from a proto
file.
In this example, we define events.proto
which you can use protoc
to generate events_pb2.py
.
Note, if you are interested in following along, you may go through the hands on section. Otherwise feel free to skip to the next section.
Hands on - setup (Optional)
Installation & Setup
-
“Files”
After finishing the steps in the tabs, this is how your directory should look like:
1 2 3 4
. ├── events.proto ├── events_pb2.py └── script.py (or ipynb)
You can also choose to use juypter notebook instead of python script.
As of this writing, the latest protobuf version is
3.12.3
.We are now good to go!
-
“1.Prep Py Env”
For this guide, we will be using Anaconda
Prepare python environment
1 2 3
conda create -n stream python=3.7 conda activate stream conda install protobuf==3.12.3 flask==1.1.2
If you are not using Anaconda, and prefer native python instead:
1 2
sudo pip install google protobuf pip install flask
-
“2.Install Protobuf”
For installation guides, please refer to the google documentation2 or if you are in a hurry:
Using a mac:
1
brew install protobuf
If you prefer not to use brew,
1 2 3 4 5
PROTOC_ZIP=protoc-3.12.3-osx-x86_64.zip curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.12.3/$PROTOC_ZIP sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' rm -f $PROTOC_ZIP
Using linux,
1 2 3 4 5
PROTOC_ZIP=protoc-3.12.3-linux-x86_64.zip curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.12.3/$PROTOC_ZIP sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' rm -f $PROTOC_ZIP
-
“3.Defining events.proto”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
syntax = "proto2"; package tutorial; message PaymentInfo { required int32 account_id = 1; optional float sales_amount = 2; enum payment_method { CASH = 0; CREDIT_CARD = 1; WALLET = 2; } optional payment_method method = 3; }
-
“4.Protoc to events_pb2.py”
In your current directory
1 2 3 4
INPUT_FILE=events.proto SRC_DIR="$(pwd)" DST_DIR=$SRC_DIR protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/$INPUT_FILE
You should see an
events_pb2.py
output.
Using Protobuf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import events_pb2 #created by above steps
payment = events_pb2.PaymentInfo()
payment.account_id = 1234
payment.sales_amount = 142.0
payment.method = 1 #CREDITCARD
print(payment)
"""
account_id: 1234
sales_amount: 142.0
method: CREDIT_CARD
"""
message = payment.SerializeToString()
print(message) #b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
type(message) #<class 'bytes'>
# Downstream user recieving the message:
payment_proto = events_pb2.PaymentInfo()
payment_proto.ParseFromString(message)
payment_proto #same output as above
from google.protobuf.json_format import MessageToJson
json_msg = MessageToJson(payment_proto)
print(json_msg)
"""
{
"accountId": 1234,
"salesAmount": 142.0,
"method": "CREDIT_CARD"
}
"""
Your message is serialized to '\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
!
Python Object
-
You can access the features as an object. For instance:
1 2
payment_proto.account_id #1234 payment_proto.method #1
Data Size
-
You can reduce the size of the message being sent
1 2 3 4
# compare sizes import sys sys.getsizeof(message) #43 sys.getsizeof(json_msg) #123
Wrong data type
-
What if you define data wrongly?
1 2
payment = events_pb2.PaymentInfo() payment.account_id ="123"
1 2 3
Traceback (most recent call last): File "<input>", line 1, in <module> TypeError: '123' has type str, but expected one of: int, long
or miss out a field?
1 2 3
payment = events_pb2.PaymentInfo() payment.sales_amount = 142.0 payment.SerializeToString()
1 2 3
Traceback (most recent call last): File "<input>", line 3, in <module> google.protobuf.message.EncodeError: Message tutorial.PaymentInfo is missing required fields: account_id
Changing Schema
Now, as your business grows, you start introducing a few more features, and would like to change your data model.
fields | data type |
---|---|
account_id | int |
sales_total | float |
sales_currency | enum |
method | enum |
cart_id | int |
Notice that sales_amount
has now been changed to sales_total
. You will then define the new schema, and generate the python module again. For the purpose of a later demostration, we will keep the old python module and rename it to events_old_pb2.py
.
Similarly, you may choose to follow the hands on or skip to the next step.
Hands on - Changing Schema (Optional)
Rename your old proto from events_pb2.py
to events_old_pb2.py
1
mv events_pb2.py events_old_pb2.py
and the events.proto
should look like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
syntax = "proto2";
package tutorial;
message PaymentInfo {
required int32 account_id = 1;
optional float total_sales = 2;
enum currency {
SGD = 0;
USD = 1;
}
optional currency sales_currency = 4;
enum payment_method {
CASH = 0;
CREDIT_CARD = 1;
WALLET = 2;
}
optional payment_method method = 3;
optional int32 cart_id = 5;
}
Notice the changes from sales_amount
to total_sales
Forward compatibility
Let’s use the same old message in the earlier example: b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
everytime you import a protobuf module, you need to restart the python kernel (or start a new one)
1
2
3
4
5
6
7
8
9
10
11
12
13
import events_pb2
new_parser = events_pb2.PaymentInfo()
old_message = b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
new_parser.ParseFromString(old_message)
new_parser
"""
account_id: 1234
total_sales: 142.0
method: CREDIT_CARD
"""
the sales_amount
automatically converts to total_sales
!
Backward compatibility
Now, lets define a new event with the new schema (along with some new tricks).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import events_pb2
from google.protobuf.json_format import MessageToJson
new_parser = events_pb2.PaymentInfo()
new_parser.account_id = 123
new_parser.total_sales = 1000.0
new_parser.sales_currency = new_parser.currency.Value('USD')
new_parser.method = new_parser.payment_method.Value('CASH')
new_parser.cart_id = 1
print(new_parser)
"""
account_id: 123
total_sales: 1000.0
method: CASH
sales_currency: USD
cart_id: 1
"""
new_message = new_parser.SerializeToString()
print(new_message) #b'\x08{\x15\x00\x00zD\x18\x00 \x01(\x01'
#Extra info
import sys
sys.getsizeof(new_message) #46
sys.getsizeof(MessageToJson(new_parser)) #156
Restart your python kernel, and using the old protobuf module,
1
2
3
4
5
6
7
8
9
10
11
12
13
import events_old_pb2
old_parser = events_old_pb2.PaymentInfo()
new_message=b'\x08{\x15\x00\x00zD\x18\x00 \x01(\x01'
old_parser.ParseFromString(new_message)
print(old_parser)
"""
account_id: 123
sales_amount: 1000.0
method: CASH
"""
This way, up/down stream teams can have a centralize protobuf file to communicate, and systems are decoupled. The upstream services can make changes to the schema without fear of breaking downstream systems. Correspondingly, the downstream services can update their changes with the new protobuf file in their next release cycle.
Wonderful, isn’t it?
this does not come free, you need to obey some guidelines3.
Adding more message types
There are many other ways to use protobuf, please refer to the reference4.
Below is an example on adding a new message using repeated fields.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
syntax = "proto2";
package tutorial;
message PaymentInfo {
required int32 account_id = 1;
optional float total_sales = 2;
enum currency {
SGD = 0;
USD = 1;
}
optional currency sales_currency = 4;
enum payment_method {
CASH = 0;
CREDIT_CARD = 1;
WALLET = 2;
}
optional payment_method method = 3;
optional int32 cart_id = 5;
}
message Cart {
required int32 cart_id = 1;
message Item {
optional string item = 1;
optional int32 quantity = 2;
optional float amount = 3;
}
repeated Item Items = 2;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import events_pb2
from google.protobuf.json_format import MessageToJson
cart_event = events_pb2.Cart()
cart_event.cart_id = 5678
cart_items = cart_event.Items.add()
cart_items.amount = 40.0
cart_items.quantity = 3
cart_items.item = 'chicken'
cart_items = cart_event.Items.add()
cart_items.amount = 60.0
cart_items.quantity = 2
cart_items.item = 'beef'
data = MessageToJson(cart_event)
import json
json.loads(data)
Json output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"cartId":5678,
"Items":[
{
"item":"chicken",
"quantity":3,
"amount":40.0
},
{
"item":"beef",
"quantity":2,
"amount":60.0
}
]
}
Neat, right?
Conclusion
Hopefully you have learnt abit on protobuf. You will find many articles / online posts advocating for protobuf, describing the benefits. As always, there are two sides to a coin, I leave you with the other side5 if you are interested.