Member-only story
Improve Apache Beam Performance with Grouping Records
when need to call external API, we should do so by batching multiple records
3 min readAug 27, 2022
Originally Published @ https://asyncq.com/
Introduction
- Apache Beam provides immutable PCollection to store read input data. If we read input file as string , the read input data will be stored into
PCollection <String>. - Its not uncommon to have interaction with external API through REST interface , where we have to send bunch of records to a service and expect some output.
- If we read n number of records parrellely and call external api parellely, we will overwhelm the external API service and might hit the api limit rate .This is very important use cases where we need to batch input records.
- In this article we will see how to batch input records so that we send batches of records to the rest endpoint and respect api limit.
Use Case
- Very generic use case would be to read sensitive data from input file and send to some dlp service to mask it before we can write to our data warehouse.
Solution
- If we go through apache beam documentation we will come across a function called…