<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Bemi Blog]]></title><description><![CDATA[The latest technical posts, announcements, and guides on automatic audit trail and data change tracking for PostgreSQL.]]></description><link>https://blog.bemi.io/</link><image><url>https://blog.bemi.io/favicon.png</url><title>Bemi Blog</title><link>https://blog.bemi.io/</link></image><generator>Ghost 5.75</generator><lastBuildDate>Mon, 13 Apr 2026 11:04:44 GMT</lastBuildDate><atom:link href="https://blog.bemi.io/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Announcing Our Pre-Seed Round]]></title><description><![CDATA[<p>We&#x2019;re excited to announce that Bemi has closed a pre-seed round led by Night Capital. Among the funds invested were Mucker Capital, Niche Capital, AngelList, Materialized View Capital, and several prominent angels. We&#x2019;re grateful to our investors and supporters for believing in our mission.</p><p>We&apos;</p>]]></description><link>https://blog.bemi.io/pre-seed/</link><guid isPermaLink="false">66b2f0fbd9dfee00016e3fe3</guid><category><![CDATA[Announcement]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Fri, 27 Sep 2024 16:55:00 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/09/blog-1.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/09/blog-1.png" alt="Announcing Our Pre-Seed Round"><p>We&#x2019;re excited to announce that Bemi has closed a pre-seed round led by Night Capital. Among the funds invested were Mucker Capital, Niche Capital, AngelList, Materialized View Capital, and several prominent angels. We&#x2019;re grateful to our investors and supporters for believing in our mission.</p><p>We&apos;re on a mission to make Postgres data tracking infrastructure incredibly easy for developers. We have a team with deep technical knowledge in this space so that companies never have to go down the painful path of building and maintaining complex infrastructure with teams of data engineers, devops, database administrators, and software developers.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">?</div><div class="kg-callout-text">Read more in our exclusive with <a href="https://betakit.com/universe-alums-aim-to-leverage-developer-infrastructure-experience-with-bemi/?ref=blog.bemi.io">BetaKit</a></div></div><p>We&#x2019;ll use the funds to build out our engineering team and support our growth. Bemi is critical infrastructure for our customers, so we&#x2019;re doubling down on our continued heavy investments in reliability, <a href="https://bemi.io/security?ref=blog.bemi.io">security</a>, and <a href="https://trust.bemi.io/?ref=blog.bemi.io">compliance</a> to ensure we continue to meet their needs. </p><p>Postgres is the world&apos;s <a href="https://survey.stackoverflow.co/2024/?ref=blog.bemi.io">most loved database by developers</a>, and like Postgres, we&#x2019;re committed to transparency, extensibility, open source, and zero hosting vendor lock-in. </p><blockquote>
<p>We love working with the Bemi team. Their customer service is incredible &#x2014; responsive, knowledgeable, and always willing to go the extra mile. They&#x2019;re a talented team, and we have full confidence in their expertise.<br>
&#x2014; &#xC1;lvaro Serrano, CTO <a href="https://klog.co/en/?ref=blog.bemi.io">KLog</a></p>
</blockquote>
<p>We&#x2019;ve got exciting product updates on the roadmap and can&#x2019;t wait to share our progress with you. Stay tuned!</p><p>?</p>]]></content:encoded></item><item><title><![CDATA[When Postgres Indexing Went Wrong]]></title><description><![CDATA[It’s important to understand basics of indexing and best practices around them for preventing system downtime. ]]></description><link>https://blog.bemi.io/indexing/</link><guid isPermaLink="false">66edd4c1d9dfee00016e4191</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Mon, 23 Sep 2024 05:08:00 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/09/image-671--2-.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/09/image-671--2-.png" alt="When Postgres Indexing Went Wrong"><p>Indexing in Postgres seems simple, but it&#x2019;s important to understand the basics of how it really works and the best practices for preventing system downtime.</p><p>TLDR: Be careful when creating indexes &#x2014; a lesson I learned the hard way when concurrent indexing failed silently.</p><h2 id="critical-incident">Critical incident</h2><p>At a previous company, we managed a high-volume Postgres instance with billions of rows of transactional data. As we scaled, query performance became a key priority, and one of the first optimizations was adding indexes. To avoid downtime, we used <code>CREATE INDEX CONCURRENTLY</code>, which allows indexing large tables without locking out writes for hours. Initially, p99 query performance improved dramatically.</p><p>A few weeks later, another team launched a new feature that was built to rely heavily on the new index. Everything seemed routine&#x2014;until the traffic spiked.</p><p>At first, the problem was subtle. A few queries took longer than expected. But within hours, the load began to spike. Query response times slowed to a crawl, and some requests were timing out. </p><p>We couldn&#x2019;t immediately see why. The index was in place, a quick <code>EXPLAIN ANALYZE</code> confirmed it was being used. But users were still experiencing massive slowdowns, and we were on the brink of a full-scale production outage.</p><p>It wasn&#x2019;t until we checked the server logs did we piece together what happened:</p><pre><code class="language-sql">CREATE INDEX CONCURRENTLY idx_email_2019 ON users_2019 (email);
ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 54321; blocked by process 54322.
</code></pre><h2 id="concurrent-indexing-can-fail-silently"><strong>Concurrent indexing can fail (silently)</strong></h2><p>Concurrent indexing needs more total work than a standard index build and takes much longer to complete. It uses a 2 phase approach that helps avoid locking the table:</p><ul><li><strong>Phase 1:</strong> A snapshot of the current data gets taken, and the index is built on that.</li><li><strong>Phase 2:</strong> Postgres then catches up with any changes (inserts, updates, or deletes) that happened during phase 1.</li></ul><p>Since this process is asynchronous, the <code>CREATE INDEX</code> command might fail, leaving an incomplete index behind. An &#x201C;invalid&#x201D; index is ignored during querying, but this oversight can have serious consequences if not monitored.</p><pre><code>postgres=# \d users_emails_2019
       Table &quot;public.users_emails_2019&quot;
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
  ...   |            |           |          |
Indexes:
    &quot;idx&quot; btree (email) INVALID
</code></pre><p>In our case, the issue was amplified by the fact that our data was partitioned. The index had failed on some partitions but not others, leading to a situation where some queries were using the index while others were hitting unindexed partitions. This imbalance resulted in uneven query performance and significantly increased load on the system.</p><p>If we hadn&#x2019;t caught it when we did, we would have faced a full-blown production outage, impacting every user on the platform.</p><h2 id="best-practices-for-postgres-indexing"><strong>Best practices for Postgres indexing</strong></h2><p>To help others navigate this terrain, here are some best practices for Postgres indexing that can prevent these issues:</p><h3 id="avoid-dangerous-operations"><strong>Avoid dangerous operations</strong></h3><p>Always use the <code>CONCURRENTLY</code> flag when creating indexes in production. Without it, even smaller tables can block writes for unacceptably long, leading to system downtime. While <code>CONCURRENTLY</code> takes more CPU and I/O, the trade-off is worth it to maintain availability. Keep in mind that concurrent index builds can only happen one at a time on the same table, so plan accordingly for large datasets.</p><h3 id="monitor-concurrent-index-creation-closely"><strong>Monitor concurrent index creation closely</strong> </h3><p>Don&#x2019;t take successful index creation for granted. The system table <code>pg_stat_progress_create_index</code> can be queried for progress reporting while indexing is taking place.</p><pre><code class="language-sql">postgres=# SELECT * FROM pg_stat_progress_create_index;
-[ RECORD 1 ]------+---------------------------------------
pid                | 896799
datid              | 16402
datname            | postgres
relid              | 17261
index_relid        | 136565
command            | CREATE INDEX CONCURRENTLY
phase              | building index: loading tuples in tree
lockers_total      | 0
lockers_done       | 0
current_locker_pid | 0
blocks_total       | 0
blocks_done        | 0
tuples_total       | 10091384
tuples_done        | 1775295
partitions_total   | 0
partitions_done    | 0
</code></pre><h3 id="manually-validate-indexes"><strong>Manually validate indexes</strong></h3><p>If you don&#x2019;t check your indexes, you might think you&#x2019;re able to rely on them when you can&#x2019;t. And although an invalid index gets ignored during querying, it still consumes update overhead. Common causes for index failures include:</p><ol><ul><li>Deadlocks: Index creation might conflict with ongoing transactions, leading to deadlocks.</li><li>Disk Space: Large indexes may fail due to insufficient disk space.</li><li>Constraint Violations: Creating unique indexes on columns with non-unique data will result in failures.</li></ul></ol><p>You can find all invalid indexes by running the following:</p><pre><code>SELECT * FROM pg_class, pg_index WHERE pg_index.indisvalid = false AND pg_index.indexrelid = pg_class.oid;
</code></pre><p>You can also query the <code>pg_stat_all_indexes</code> and <code>pg_statio_all_indexes</code> system views to verify that the index is being accessed.</p><h3 id="fix-invalid-indexes">Fix invalid indexes</h3><p>Invalid indexes can be recovered using the <code>REINDEX</code> command. It&#x2019;s the same as dropping and recreating the index, except it would also lock out reads that attempt to use that index (if not specifying <code>CONCURRENTLY</code>). Note that <code>CONCURRENTLY</code> reindexing isn&#x2019;t supported in versions below Postgres 12.</p><pre><code class="language-sql">REINDEX INDEX CONCURRENTLY idx_users_email_2019;
</code></pre><p>If a problem occurs while rebuilding the indexes, it&#x2019;d leave behind a new invalid index suffixed with&#xA0;<code>_ccnew</code>. Drop it and retry&#xA0;<code>REINDEX CONCURRENTLY</code>.</p><pre><code class="language-sql">postgres=# \d users_2019
       Table &quot;public.tab&quot;
 Column |  Type   | Modifiers
--------+---------+-----------
 col    | integer |
Indexes:
    &quot;users_emails_2019&quot; btree (col) INVALID
    &quot;users_emails_2019_ccnew&quot; btree (col) INVALID
</code></pre><p>If the invalid index is suffixed with <code>_ccold</code>, it&#x2019;s the original index that wasn&#x2019;t fully replaced. You can safely drop it, as the rebuild has succeeded.</p><h3 id="create-partition-indexes-consistently"><strong>Create partition indexes consistently</strong></h3><p>Newly created partitioned tables or small tables (&lt;100k) can easily just create indexes synchronously on the parent table, and it&apos;d automatically propagate indexes to all partitions, including any newly created ones in the future.</p><pre><code>CREATE INDEX idx_users_email ON users (email);
</code></pre><p>But it&#x2019;s currently not possible to use the <code>CONCURRENTLY</code> flag when creating an index on the root partitioned table. What you should use instead is the <code>ONLY</code> flag. This tells the parent table to not apply the index recursively to children, so the table isn&#x2019;t locked.</p><pre><code class="language-sql">-- Create an index on the parent table (metadata only operation);
CREATE INDEX idx_users_email ON ONLY users (email);
</code></pre><p>This creates an invalid index first. Then we can create indexes for each partition and attach them to the parent index:</p><pre><code class="language-sql">CREATE INDEX CONCURRENTLY idx_users_email_2019
    ON users_2019 (email);
ALTER INDEX idx_users_email
    ATTACH PARTITION idx_users_email_2019;

CREATE INDEX CONCURRENTLY idx_users_email_2020
    ON users_2020 (email);
ALTER INDEX idx_users_email
    ATTACH PARTITION idx_users_email_2020;

// repeat for all partitions
</code></pre><p>Only once all partitions are attached, the index for the root table will be marked as valid automatically. The parent itself is just a &#x201C;virtual&#x201D; table without any storage, but can serve to ensure all partitions maintain a consistent indexing strategy.</p><h3 id="check-the-query-execution-plan"><strong>Check the query execution plan</strong></h3><p>Using the <code>EXPLAIN ANALYZE</code> command provides a comprehensive view of the query execution plan, detailing how Postgres processes your query. This breakdown is essential for verifying that the expected indexes are being utilized effectively.</p><pre><code class="language-sql">EXPLAIN ANALYZE SELECT * FROM users_2019 WHERE email = &apos;arjun@bemi.io&apos;;

Index Scan using idx_users_email_2019 on users_2019  (cost=0.15..0.25 rows=1 width=48) (actual time=0.015..0.018 rows=1 loops=1)
  Index Cond: (email = &apos;arjun@bemi.io&apos;::text)
Planning Time: 0.123 ms
Execution Time: 0.028 ms
</code></pre><h3 id="remove-unused-indexes"><strong>Remove unused indexes</strong></h3><p>Sometimes the indexes we add aren&#x2019;t as valuable as expected. To prune our indexes to optimize write performance, we can check which indexes haven&#x2019;t been used:</p><pre><code class="language-sql">select 
    indexrelid::regclass as index, relid::regclass as table 
from 
    pg_stat_user_indexes 
    JOIN pg_index USING (indexrelid) 
where 
    idx_scan = 0 and indisunique is false;
</code></pre><p>By implementing these best practices, you can avoid scary mistakes. Remember to monitor, validate, and understand the implications of your indexing strategy. The cost of overlooking these details can be significant, and a proactive approach will help you maintain a stable and efficient database.</p><p><em>At </em><a href="https://bemi.io/?ref=blog.bemi.io"><em>Bemi</em></a><em>, we specialize in handling audit trails at large volumes, where storage optimization and the right indexing strategies are crucial. We have to deeply understand Postgres storage and indexing internals to ensure 100% reliability and performance. We&#x2019;ve had to build out index health monitoring at scale and also automated safeguards to ensure indexes are always valid and queries optimized. In a future blog, I&#x2019;ll share some of the internal performance tooling and tech we use under the hood.</em></p><p><em>But when Postgres indexing isn&apos;t enough to scale, check out the </em><a href="https://github.com/BemiHQ/BemiDB?ref=blog.bemi.io" rel="noopener"><em>BemiDB GitHub repo</em></a><em> for handling analytical workloads on Postgres.</em></p>]]></content:encoded></item><item><title><![CDATA[It’s Time to Rethink Event Sourcing]]></title><description><![CDATA[The traditional approach to implementing Event Sourcing comes with many challenges. In this blog post, I’ll share new ideas on how to achieve 80% of the Event Sourcing benefits with 20% effort.]]></description><link>https://blog.bemi.io/rethinking-event-sourcing/</link><guid isPermaLink="false">66d23304d9dfee00016e405e</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Evgeny Li]]></dc:creator><pubDate>Tue, 03 Sep 2024 14:24:04 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/09/blog--7-.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/09/blog--7-.png" alt="It&#x2019;s Time to Rethink Event Sourcing"><p>I&apos;ve always been fascinated by Event Sourcing (ES) and other Domain-Driven Design (DDD) concepts. At some point, I even built a prototype of an event-sourced system called <a href="https://martinfowler.com/articles/lmax.html?ref=blog.bemi.io">LMAX</a>, which handles 6M orders per second as a high-frequency trading platform.</p><p>Unfortunately, the traditional approach to implementing Event Sourcing comes with its own set of challenges. In this blog post, I&#x2019;ll share new ideas on how to achieve 80% of the Event Sourcing benefits with 20% effort.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">Event Sourcing is a unicorn idea that captivates many developers, but it is rarely adopted and implemented successfully.</em></i></div></div><h2 id="why-use-event-sourcing">Why Use Event Sourcing</h2><p>At its core, Event Sourcing is a simple architectural design pattern. All data changes are recorded as an immutable sequence of events in an append-only store, which becomes the main source of truth for application data. That&#x2019;s it.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">Event Sourcing is a simple, yet powerful concept.</em></i></div></div><p>This design pattern provides many advantages:</p><ul><li><strong>Data integrity</strong>. Unlike typical CRUD (Create/Read/Update/Delete) systems, stored events can&#x2019;t be modified to ensure data integrity.</li><li><strong>Auditability.</strong> The append-only store of events represents an audit trail that make it easy to track and audit changes.</li><li><strong>Traceability</strong>. Events contain the context such as the &#x2018;what&#x2019;, &#x2018;when&#x2019;, &#x2018;why&#x2019; and &#x2018;who&#x2019;, so you can easily trace and verify transactions.</li><li><strong>Compliance</strong>. Event store is a detailed log of all state changes, which is essential in regulated industries like finance, healthcare, etc.</li><li><strong>Rollbacks</strong>. If the current state is lost or corrupted, you can rebuild it by replaying the immutable events.</li><li><strong>Troubleshooting</strong>. The event store can be used for debugging and allows understanding the sequence of events leading to an issue.</li><li><strong>Time travel</strong>. Event Sourcing enables time travel capabilities by allowing you to reconstruct the previous state at any point in time.</li><li><strong>Enhanced analytics</strong>. It allows generating custom data representations (projections) to query historical data and identify patterns.</li><li><strong>Scalability and performance</strong>. Events can be handled asynchronously, which can improve performance and scalability.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/blog--10-.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="1334" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/blog--10-.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/blog--10-.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/blog--10-.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/blog--10-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Traditional Event Sourcing system</span></figcaption></figure><h2 id="event-sourcing-examples">Event Sourcing Examples</h2><p>Most of us use existing event-sourced systems every day and can&#x2019;t imagine living without them.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">Git and bank ledger are frequently used Event Sourcing systems.</em></i></div></div><h3 id="bank-ledger-account"><strong>Bank ledger account</strong></h3><p>When you load information about your bank account, most online banks will show you recent ledger transactions, which represent event-sourced records of every money movement in your account.</p><p>The idea of recording ledgers as an Event Sourcing system was used way before computer systems were invented. Around 7000 years ago, ledgers were used to record lists of expenditures and goods traded on clay tablets, while temples were considered the banks of the time.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/blog--11-.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="1349" height="900" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/blog--11-.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/blog--11-.png 1000w, https://blog.bemi.io/content/images/2024/09/blog--11-.png 1349w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Clay tablets as bank ledgers</span></figcaption></figure><h3 id="version-control-system">Version control system</h3><p>Version control systems, such as Git, are examples of Event Sourcing systems. Commits represent code changes that are recorded sequentially and become the main source of truth.</p><p>Additionally, commits record information about &#x2018;who&#x2019; made the change, &#x2018;when&#x2019; the change happened, and &#x2018;why&#x2019; it was made via a commit message.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/Table--1-.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="752" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/Table--1-.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/Table--1-.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/Table--1-.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/Table--1-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Git as an Event Sourcing concept</span></figcaption></figure><p>This means that you can view a history of all changes, time travel by checking to a previous commit, rollback changes, troubleshoot issues by using a binary search, analyze code changes, and so on. You&#x2019;ve got the idea.</p><h2 id="issues-with-traditional-event-sourcing">Issues with Traditional Event Sourcing</h2><p>While Event Sourcing has many benefits, it also comes with many disadvantages that prevent it from being adopted more widely.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">Event Sourcing is a simple idea that is very hard to implement.</em></i></div></div><ul><li><strong>Big paradigm shift</strong>. It is a fundamentally different approach to data management that goes against commonly techniques such as RESTful APIs and UPDATE/DELETE database operations.</li><li><strong>Extra dependencies</strong>. Implementing Event Sourcing usually requires introducing additional concepts, such as CQRS (Command and Query Responsibility Segregation), which need to handle events, take snapshots, and rebuild projections.</li><li><strong>Steep learning curve</strong>. Event Sourcing introduces new concepts and patterns that developers might not be familiar with, which can require additional time to adapt to the event-centric approach and learn new tools.</li><li><strong>Eventual consistency</strong>. Event processing at scale is generally done asynchronously, which requires rethinking how data is being accessed. For example, when a user submits a multi-step form, you won&#x2019;t be able to show a summary with all saved information and will be required to just show a confirmation page in the UI instead.</li><li><strong>Event versioning</strong>. As your system evolves, you&#x2019;d need to change the format of your events to, for example, start storing additional information. And because events are immutable, you&#x2019;d need strategies for maintaining backwards compatibility or migrating old events.</li><li><strong>Storage and compute needs</strong>. Since all events are stored and never deleted, it requires more storage and compute resources, which typically involves implementing an event streaming system.</li><li><strong>Expensive migration</strong>. If you have a large non-event-sourced system, transitioning to Event Sourcing can be a significant undertaking that requires changing almost the entire codebase, backfilling past events, and careful testing.</li><li><strong>Upstart cost.</strong> There is lots of literature on Event Sourcing, but there are no universal and flexible frameworks that can work with any tech stack. That&#x2019;s why most teams DIY and implement all additional code plumbing around commands, command handlers, validators, aggregates, and so on themselves.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/blog--12-.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="1334" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/blog--12-.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/blog--12-.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/blog--12-.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/blog--12-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Developer productivity over time with a CRUD system vs Event Sourcing</span></figcaption></figure><p>Is there a way to get most of the Event Souring benefits while avoiding its disadvantages?</p><h2 id="the-new-approach-to-event-sourcing">The New Approach to Event Sourcing</h2><p>The disadvantages of Event Sourcing listed above make it a complete nonstarter for most companies. Let&apos;s reconsider the traditional Event Sourcing approach by taking a closer look at how we use a version control system like Git.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/Table-1.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="934" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/Table-1.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/Table-1.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/Table-1.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/Table-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Rethinking traditional Event Sourcing through the lens of a version control system</span></figcaption></figure><p>As you can see, with Git:</p><ul><li>We just reuse the existing tool for different projects without trying to invent a wheel.</li><li>Continue editing and working with mutable code files instead of thinking how to construct diffs.</li><li>Occasionally contextualize and wrap changes into commits, a higher-level standardized data abstraction.</li><li>Get automatic state reconstruction that supports time traveling, rollbacks, and have a full audit log.</li></ul><p>We can&#x2019;t, however, blindly copy the Git model and apply it to build &#x201C;Git for data&#x201D;. The main reason is that Git commits are usually committed manually by developers, while data in applications is frequently changed automatically. Instead, we need to use a slightly different approach.</p><h3 id="change-data-capture-and-its-limitations">Change Data Capture, and its limitations</h3><p>Change Data Capture (CDC) is a design pattern used to identify and capture changes made to data in a database in real-time. For example, when moving data from an online transaction processing (OLTP) database like PostgreSQL to an online analytical processing (OLAP) system like Snowflake, people typically use CDC to ingest changes and record them in a data warehouse.</p><figure class="kg-card kg-code-card"><pre><code class="language-json">{
   &quot;table&quot;: &quot;shopping_cart_items&quot;,
   &quot;primary_key&quot;: 1,
   &quot;operation&quot;: &quot;UPDATE&quot;,
   &quot;committed_at&quot;: &quot;2024-09-01 17:09:15+00&quot;,
   &quot;before&quot;:{
      &quot;id&quot;: 1,
      &quot;quantity&quot;: 1,
      ...
   },
   &quot;after&quot;:{
      &quot;id&quot;: 1,
      &quot;quantity&quot;: 2,
      ...
   }
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Captured change</span></p></figcaption></figure><p>We could continue performing CRUD operations in a regular database (behaves like the latest snapshot) without rearchitecting our application, use CDC to capture all data changes in the background and store them as an immutable audit log (behaves like an event store).</p><p>There is, however, one big fundamental difference between Event Sourcing and Change Data Capture:</p><ul><li><strong>Event Sourcing</strong>: Events reflect domain-related processes that happened at the application level. For example, &#x201C;shopping item quantity increased&#x201D;.</li><li><strong>Change Data Capture</strong>: Changes reflect low-level data changes. For example, &#x201C;a database row in a <code>shopping_cart_items</code> table with an ID <code>1</code> was updated&#x201D;.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/Table--1--1.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="934" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/Table--1--1.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/Table--1--1.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/Table--1--1.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/Table--1--1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Similarities between CDC and version control systems</span></figcaption></figure><p>To bridge the gap and make database changes captured with CDC meaningful and consistent, we can use a couple of different approaches.</p><h3 id="approach-1-outbox-pattern-with-change-data-capture">Approach 1: Outbox pattern with Change Data Capture</h3><p>The Outbox pattern allows to atomically update data in a database and record messages that need to be sent in order to guarantee data consistency.</p><p>When performing regular database record changes, we can also insert event records in an &#x201C;ephemeral&#x201D; outbox table within the same transactions:</p><figure class="kg-card kg-code-card"><pre><code class="language-sql">BEGIN;
  UPDATE shopping_cart_items SET quantity = 2 WHERE id = 1;
  UPDATE products SET in_stock_count = in_stock_count - 1 WHERE id = 123;
  INSERT INTO outbox_events (event_type, entity_type, entity_id, payload) VALUES (...);
COMMIT;
</code></pre><figcaption><p><span style="white-space: pre-wrap;">Inserting events using the Outbox pattern</span></p></figcaption></figure><p>After the transaction completes, the domain-specific events can be reliably captured by CDC and permanently stored in an event store similarly to a traditional Event Sourcing approach.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/blog--13-.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="1334" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/blog--13-.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/blog--13-.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/blog--13-.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/blog--13-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Event Sourcing using the Outbox pattern and Change Data Capture</span></figcaption></figure><p>With this approach, we get the simplicity of a typical CRUD system and the benefits of an immutable and consistent append-only event store derived from data changes with CDC.</p><h3 id="approach-2-contextualized-change-data-capture">Approach 2: Contextualized Change Data Capture</h3><p>Another slightly simplified and more practical approach is to contextualize data changes in CDC pipelines without making any modifications to the underlying data structure and database queries.</p><p>With a database like PostgreSQL, it&#x2019;s possible to pass additional context with queries that can only be visible by a CDC system. Here is a simple code example written in JavaScript using Prisma ORM:</p><pre><code class="language-js">setContext({
  // Event-related data
  eventType: &apos;SHOPPING_CART_ITEM_QUANTITY_UPDATED&apos;,
  entityType: &apos;SHOPPING_CART_ITEM&apos;,
  entityId: 1,
  quantity: 2,
  // Additional context
  userId: currentUser.id,
  apiEndpoint: req.url,
});

await prisma.shoppingCartItem.update({
  where: { id: 1 },
  data: { quantity: 2 },
});
await prisma.products.update({
  where: { id: 123 },
  data: { inStockCount: product.inStockCount - 1 },
});</code></pre><p>After the changes are committed to the database, we can reliably capture them, stitch with the context, and store as audit trail records.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/blog--14-.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="1334" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/blog--14-.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/blog--14-.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/blog--14-.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/blog--14-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Event Sourcing using Change Data Capture and data change contextualization</span></figcaption></figure><p>With this approach, we can continue using CRUD operations and store all event data as context in an immutable and reliable audit trail. This allows us, for example, to query all events by a particular &#x201C;Shopping Cart Item&#x201D; and see all underlying data changes made as part of these events.</p><h2 id="conclusion">Conclusion</h2><p>It&#x2019;s time to rethink Event Sourcing and stop trying to reinvent the wheel every time we want to implement it in our applications.</p><p>In some regulated industries like accounting there are already well-established industry standards for using Event Sourcing in a form of a double-entry bookkeeping system, such as a General Ledger.</p><p>In 95% of other cases, you can get most of the Event Sourcing benefits by using Change Data Capture enriched with your domain-specific information. The Change Data Capture data design pattern allows to reliably track and record all data changes made in a database. This, in combination with the Outbox pattern or data change contextualization implemented in the application, allows you to achieve the Event Sourcing advantages mentioned at the beginning of this blog post.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">It is possible to event-source any system by implementing Change Data Capture and enriching it with domain-specific information.</em></i></div></div><p>This essentially flips the paradigm and allows deriving an immutable log of domain-specific events from regular database changes. Note that the described approaches are not meant to replace the business layer in your application. You still need to think about your domain design and implement it in your code.</p><hr><h2 id="about-us">About us</h2><p>If you need help with event-sourcing your system, check out <a href="https://bemi.io/?ref=blog-es" rel="noreferrer">Bemi</a>. Our solution can help you enable automatic data change tracking for your database in a few minutes, integrate it with your ORM for data change contextualization, and have a full audit trail automatically stored in a serverless PostgreSQL database.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/09/Table--2--1.png" class="kg-image" alt="It&#x2019;s Time to Rethink Event Sourcing" loading="lazy" width="2000" height="1242" srcset="https://blog.bemi.io/content/images/size/w600/2024/09/Table--2--1.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/09/Table--2--1.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/09/Table--2--1.png 1600w, https://blog.bemi.io/content/images/size/w2400/2024/09/Table--2--1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Event Sourcing via CDC vs traditional Event Sourcing</span></figcaption></figure><p><em>For scaling a centralized Postgres data store, check out the </em><a href="https://github.com/BemiHQ/BemiDB?ref=blog.bemi.io"><em>BemiDB Github repo</em></a><em>.</em></p>]]></content:encoded></item><item><title><![CDATA[Bemi achieves SOC 2 compliance]]></title><description><![CDATA[<p>At Bemi, security and reliability have always been at the core of what we do. Long before we even considered a SOC 2 audit, we built our systems with security, encryption protocols, and processes that went well beyond the requirements. Here are some of the <a href="https://bemi.io/security?ref=blog.bemi.io">security features</a> Bemi already had</p>]]></description><link>https://blog.bemi.io/soc2/</link><guid isPermaLink="false">66ce30b5d9dfee00016e4038</guid><category><![CDATA[Announcement]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Wed, 28 Aug 2024 01:07:04 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/08/security--3-.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/08/security--3-.png" alt="Bemi achieves SOC 2 compliance"><p>At Bemi, security and reliability have always been at the core of what we do. Long before we even considered a SOC 2 audit, we built our systems with security, encryption protocols, and processes that went well beyond the requirements. Here are some of the <a href="https://bemi.io/security?ref=blog.bemi.io">security features</a> Bemi already had in place:</p><ul><li>AES-256 storage encryption at rest</li><li>TLS in-transit encryption to protect database traffic</li><li>HTTPS in-transit encryption to encrypt all web traffic</li><li>Customers&#x2019; credentials protected with military-grade encryption algorithms</li><li>Restricted IP access rules and password credentials for destination databases</li><li>Static Bemi IPs for allowlisting a connection to source databases</li><li>Isolated internal network SSH tunnelling with certification encryption</li><li>Data and container level customer isolation</li><li>Monitoring and alerting at all stack layers</li><li>Continuous software vulnerability scanning</li></ul><p>As we grew, we realized that transparency is just as important as having strong security in place. And for many of our customers, especially those with stringent legal and security requirements, an external audit is a crucial part of building that trust.</p><h2 id="why-we-pursued-soc-2-now"><strong>Why We Pursued SOC 2 Now</strong></h2><p>SOC 2 or Service Organization Controls 2 is a framework governed by the American Institute of Certified Public Accountants (AICPA). With a SOC 2 audit, an independent service auditor will review an organization&#x2019;s policies, procedures, and evidence to determine if their controls are designed and operating effectively. A SOC 2 report communicates a company&#x2019;s commitment to data security and protection of customer information.</p><p>We decided to pursue SOC 2 compliance because we wanted to make our commitment to security as clear as possible. We&#x2019;ve always been open about our processes&#x2014;just a few months ago, we open-sourced our codebase to give everyone a closer look at what we&#x2019;ve built. In the same spirit of transparency, we recognized that an external SOC 2 audit would provide the additional assurance that larger companies&#x2019; legal and security teams look for. It&#x2019;s another step in our ongoing investment in trust.</p><h2 id="our-journey-to-soc-2-certification"><strong>Our Journey to SOC 2 Certification</strong></h2><p>We partnered with <a href="https://www.vanta.com/?ref=blog.bemi.io">Vanta</a>, the leader in Trust Management, to automate the collection of our audit evidence. Vanta provides us with the strongest security foundation to protect our customer data.</p><p>Our audit firm, <a href="https://advantage-partners.com/?ref=blog.bemi.io">Advantage Partners</a>, then stepped in to assess our controls. For the audit, Advantage Partners evaluated the controls we have in place and opined on their state. Shortly after our audit window ended, Advantage Partners drafted and issued our report.</p><p>While SOC 2 can be a big undertaking, our compliance partners greatly streamlined the process. The readiness period can take the most time but we were able to make compliance a priority to get audit ready in a matter of weeks versus months.</p><p>We also found it important to review the audit timeline with Advantage Partners, set an ideal audit date, and then work backwards to be ready in time. Now that controls are implemented, subsequent SOC 2 audits will be even more seamless.</p><h2 id="lessons-we-learned"><strong>Lessons We Learned</strong></h2><h3 id="focus-on-improving-security-posture-not-checking-boxes"><strong>Focus on Improving Security Posture, Not Checking Boxes</strong></h3><p>Compliance isn&#x2019;t a one-size-fits-all approach. It&#x2019;s about continually improving security, not just meeting the minimum requirements. At Bemi, we&#x2019;ve always seen security as an ongoing project, something that&#x2019;s woven into the fabric of our company.</p><h3 id="start-the-process-early"><strong>Start the Process Early</strong></h3><p>Implementing security measures is easier when you start early. We&#x2019;ve always prioritized building secure infrastructure, which made our SOC 2 journey smoother. By embedding security in our processes from day one, we were able to meet SOC 2 standards without needing to overhaul our systems.</p><h3 id="security-and-compliance-help-scale-your-business"><strong>Security and Compliance Help Scale Your Business</strong></h3><p>SOC 2 compliance isn&#x2019;t just about security&#x2014;it&#x2019;s also a business enabler. Many of our larger customers require vendor security reviews as part of their procurement process. With our SOC 2 report, we can move through these reviews more quickly, allowing us to scale faster and with greater trust.</p><h3 id="the-right-partners-are-key"><strong>The Right Partners Are Key</strong></h3><p>Choosing the right tools and audit partners is crucial. Vanta and Advantage Partners helped us navigate the SOC 2 process efficiently. Their expertise ensured that our journey to compliance was seamless, saving us time and effort.</p><h2 id="looking-ahead"><strong>Looking Ahead</strong></h2><p>We&#x2019;re proud of what we&#x2019;ve achieved, but this is just one step in our ongoing commitment to security. As we continue to grow, we&#x2019;ll keep investing in the tools and processes that protect our customers and build trust. Achieving SOC 2 compliance is an important milestone, but it&apos;s part of a broader mission. We were already HIPAA compliant, ensuring that we meet strict standards for healthcare data protection. Moving forward, we&apos;ll continue to prioritize security and transparency, making Bemi a company you can rely on&#x2014;both now and in the future.</p>]]></content:encoded></item><item><title><![CDATA[How KLog Saved $200,000 by Switching to Bemi Audit Trails]]></title><description><![CDATA[Learn about the engineering challenges faced at KLog in reliably streaming data changes internally and creating a powerful audit trail.]]></description><link>https://blog.bemi.io/case-study-klog/</link><guid isPermaLink="false">668f723ad9dfee00016e3f59</guid><category><![CDATA[Case Study]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Mon, 15 Jul 2024 06:00:00 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/07/Frame-2608240--3-.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/07/Frame-2608240--3-.png" alt="How KLog Saved $200,000 by Switching to Bemi Audit Trails"><p><a href="https://klog.co/?ref=blog.bemi.io" rel="noreferrer">KLog</a> democratizes international cargo transportation with an intuitive digital platform. In addition to moving cargo, KLog offers a comprehensive solution for more efficient and hassle-free logistics management. With over 5,000 customers, including Hugo Boss, Crocs, Kensington, and Wrangler, KLog has become the de facto logtech in Latin America and around the world.</p><h3 id="klog-was-spending-upwards-of-200k-in-engineering-resources-to-track-data-changes-internally">KLog was spending upwards of $200k in engineering resources to track data changes internally</h3><p>KLog required a highly reliable data audit trail that included context such as the user ID, API endpoint, request payload, and GraphQL mutation name behind a change.</p><p>The company initially built a solution internally through a cross-collaborative effort between software and data engineering teams. The system involved first adding an <code>updatedBy</code> field on every Postgres table and creating an application layer middleware that set this field on every data change in serverless functions. From the AWS RDS instances, their team then generated logs using an AWS DMS task and, due to the sheer volume, attempted to store the data in Parquet files for post-processing in S3 buckets. An alternative approach companies consider when building DIY is using Debezium for change data capture. </p><p>The system eventually encountered breakdowns when it came to reading the data, ensuring consistent application context, and maintenance.</p><blockquote>The DIY system got so complex that developers mentioned they were &apos;allergic&apos; to that part of the codebase. There could have been lost upmarket deals since it took months to initially build.<br>&#x2014; &#xC1;lvaro Serrano, CTO KLog</blockquote><h3 id="klog-switched-to-bemi-and-never-looked-back">KLog switched to Bemi and never looked back</h3><p>KLog easily connected their Postgres databases to the Bemi platform and used the <a href="https://github.com/BemiHQ/bemi-prisma?ref=blog.bemi.io"><u>Prisma ORM integration</u></a> to automatically handle the enrichment, structuring, and formatting of the lower-level database events from the Write-Ahead Logs. KLog was then able to use Bemi&#x2019;s intuitive control plane UI to significantly reduce data volumes by tracking only relevant data changes through column and table-level filtering.</p><p>Due to the high reliability and accuracy of the platform, Bemi became the 100% source of truth at KLog. Beyond just an audit trail, KLog&#x2019;s more than 40 customer success and operations team members use the Bemi UI multiple times a day as the definitive, immutable source of truth when troubleshooting shipment updates.</p><blockquote>We love working with the Bemi team. Their customer service is incredible &#x2014; responsive, knowledgeable, and always willing to go the extra mile. They&#x2019;re a talented team, and we have full confidence in their expertise.<br>&#x2014; &#xC1;lvaro Serrano, CTO KLog</blockquote><h3 id="future-plans">Future plans</h3><p>Since the Bemi data is also easily consumable with ORM-specific libraries, KLog plans to later build products centered around the data, such as customer shipment feeds. With more internal AI applications injecting data into KLog&#x2019;s platform, their operations teams are becoming auditors of data rather than data inputters, thanks to Bemi. Bemi plans to extend ORM integrations with functionality to consume data change events in real-time, allowing KLog to reliably also power their notifications, AI RAG system, and microservice communication in the future.</p><blockquote>Bemi has been a game-changer for us, highly recommend!! We&#x2019;re not in the business of tracking data changes and are now able to concentrate fully on our core logistics product.<br>&#x2014; &#xC1;lvaro Serrano, CTO KLog</blockquote><h3 id="try-out-bemi">Try out Bemi</h3><p>If you want to use Bemi to track Postgres data changes, star <a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io">Bemi on GitHub</a> and try <a href="https://dashboard.bemi.io/?ref=blog.bemi.io"><u>Bemi Cloud</u></a> for free.</p>]]></content:encoded></item><item><title><![CDATA[Choosing the Right Audit Trail Approach in Ruby]]></title><description><![CDATA[The Ruby ecosystem offers a wide range of tools for building an audit trail, each with its pros and cons. So, which one is the best choice?]]></description><link>https://blog.bemi.io/audit-trail-in-ruby/</link><guid isPermaLink="false">6632b99484ba3700018c7f9a</guid><category><![CDATA[Engineering]]></category><category><![CDATA[Guide]]></category><dc:creator><![CDATA[Evgeny Li]]></dc:creator><pubDate>Wed, 01 May 2024 22:13:12 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/05/Audit-Trail-1.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/05/Audit-Trail-1.jpg" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby"><p>Ruby gems such as <a href="https://github.com/paper-trail-gem/paper_trail?ref=blog.bemi.io" rel="noopener">PaperTrail</a> and <a href="https://github.com/collectiveidea/audited?ref=blog.bemi.io" rel="noopener">Audited</a> have been downloaded over a hundred million times and are becoming table stakes in many applications. The Ruby ecosystem offers a wide range of useful tools for building an audit trail, each with its respective pros and cons.</p><h3 id="what-is-an-audit-trail">What is an audit&#xA0;trail?</h3><p>An audit trail (audit log) is a chronological set of records representing documentary evidence of system activities. There are many use cases and benefits to having an audit trail, here are some examples:</p><ul><li><strong>Disaster recovery</strong>: selectively find and restore historical records</li><li><strong>Customer observability</strong>: save time tracking customer activity</li><li><strong>Regulatory compliance</strong>: track data access and simplify audits</li><li><strong>Fraud detection</strong>: identify fraudulent or malicious user activity</li><li><strong>Enterprise table stakes</strong>: allow monitoring activity in an organization</li><li><strong>Engineering on-call</strong>: quickly understand reasons behind data changes</li></ul><h3 id="ruby-audit-trail-solutions">Ruby audit trail solutions</h3><p>Let&#x2019;s explore and compare the following approaches to building an audit trail and decide which one of these to choose:</p><ul><li><strong>Callback-based solutions</strong>: PaperTrail, Audited, Mongoid History</li><li><strong>Trigger-based solutions</strong>: Logidze</li><li><strong>Replication log-based solutions</strong>: Bemi Rails</li><li><strong>Manual tracking</strong>: PublicActivity, Ahoy</li><li><strong>Console command logging</strong>: Console1984, Audits1984</li><li><strong>Custom logging</strong>: Marginalia</li></ul><hr><h2 id="callback-based-solutions">Callback-Based Solutions</h2><figure class="kg-card kg-image-card"><img src="https://cdn-images-1.medium.com/max/1600/1*uB_AzdPexdjAku29iDXXPQ.png" class="kg-image" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby" loading="lazy" width="2000" height="720"></figure><p><a href="https://github.com/paper-trail-gem/paper_trail?ref=blog.bemi.io" rel="noopener"><strong>PaperTrail</strong></a> and<strong> </strong><a href="https://github.com/collectiveidea/audited?ref=blog.bemi.io" rel="noopener"><strong>Audited</strong></a> are very popular gems that integrate with the ActiveRecord object-relational mapper (ORM) by using model callbacks to allow auditing data changes.</p><p>When a record is created, updated, or deleted, they insert an additional record that stores changes in a single audit table. This table stores the before/after state in JSON or JSONB formats and a reference pointing to the original record.</p><p>This approach is implemented purely on the application level and can be easily enabled for any ActiveRecord-supported database such as PostgreSQL, MySQL, or SQLite.</p><pre><code class="language-ruby">class MyModel &lt; ApplicationRecord
  has_paper_trail
end</code></pre><p>For MongoDB, <a href="https://github.com/mongoid/mongoid-history?ref=blog.bemi.io" rel="noopener"><strong>Mongoid History</strong></a> gem works similarly and integrates with Mongoid, the officially supported object-document mapper (ODM).</p><p>They also allow passing and storing application-specific context with changes, such as a user who performed the changes or an API request where the changes were triggered:</p><pre><code class="language-ruby"># User
Audited.audit_class.as_user(current_user) do
  # Additional context
  audit_comment = { endpoint: &quot;#{request.method} #{request.path}&quot; }.to_json

  my_record.update!(published: true, audit_comment: audit_comment)
end</code></pre><h3 id="pros">Pros</h3><ul><li><strong>Easy to get started</strong>. An audit trail can be enabled by installing a gem and configuring it with a few lines of code.</li><li><strong>Customization</strong>. It&#x2019;s possible, for example, to use custom serializers for the before/after state or add a complex condition for disabling tracking.</li></ul><h3 id="cons">Cons</h3><ul><li><strong>Reliability and accuracy</strong>. Many ActiveRecord methods such as <code>delete</code>, <code>update_column</code>, <code>update_all</code>, <code>delete_all</code>, and so on don&#x2019;t trigger callbacks. Thus, changes produced by these methods can&#x2019;t be tracked. Additionally, inserting data changes does not always happen atomically, which may lead to data loss and inconsistency if, for example, there is a network issue.</li><li><strong>Performance</strong>. The database workload increases by roughly 2x because each single record change produces an additional database query that inserts an audit record. This affects the application and database performance.</li><li><strong>Scalability</strong>. A single audit table can get very large. I&#x2019;ve seen cases where such tables ran out of integers used for primary keys. A large table makes it harder to manage and query at scale while also significantly increasing database resource usage and costs.</li></ul><h2 id="trigger-based-solutions">Trigger-Based Solutions</h2><figure class="kg-card kg-image-card"><img src="https://cdn-images-1.medium.com/max/1600/1*j5ozkmkbtEpHAh1yt3c0LQ.png" class="kg-image" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby" loading="lazy" width="2000" height="720"></figure><p><a href="https://github.com/palkan/logidze?ref=blog.bemi.io" rel="noopener"><strong>Logidze</strong></a> leverages the PostgreSQL triggers functionality and creates a new <code>log_data</code> JSONB column in each auditable table.</p><p>When a record is created or updated, PostgreSQL executes a row-based trigger which takes the current values of the record and appends them in the <code>log_data</code> column in a separate SQL query within the same transaction behind the scenes. Here is an example of the <code>log_data</code>:</p><pre><code class="language-json">{
  &quot;v&quot;: 2, // current record version,
  &quot;h&quot;: [  // list of changes
    {
      &quot;v&quot;: 1,                          // change version
      &quot;ts&quot;: 1460805759352,             // change timestamp
      &quot;c&quot;: { &quot;published&quot;: false }      // new values
      &quot;m&quot;: {
        &quot;_r&quot;: 42,                      // User ID
        &quot;endpoint&quot;: &quot;POST /my_records&quot; // Additional context
      }
    },
    ...
  ]
}</code></pre><p>It also allows passing and storing application-specific context with ActiveRecord changes, for example:</p><pre><code class="language-ruby"># User ID
Logidze.with_responsible(current_user.id) do
  my_record.update!(published: true)
end

# Additional context
Logidze.with_meta({ endpoint: &quot;#{request.method} #{request.path}&quot; }) do
  my_record.update!(published: true)
end</code></pre><h3 id="pros-1">Pros</h3><ul><li><strong>Improved performance</strong>. It can be almost 100% faster than callback-based solutions for record inserts and about 10% faster for updates. It still makes additional database queries on each record change, but they&#x2019;re triggered on the database level skipping ActiveRecord.</li></ul><h3 id="cons-1">Cons</h3><ul><li><strong>No delete tracking</strong>. Because audit logs are attached and stored with an original record, deleting the record will lead to losing its entire change history. To overcome this limitation, you can use callback-based solutions designed for marking records as soft-deleted and ignoring them when querying, such as <a href="https://github.com/rubysherpas/paranoia?ref=blog.bemi.io" rel="noopener">Paranoia</a>, <a href="https://github.com/jhawthorn/discard?ref=blog.bemi.io" rel="noopener">Discard</a>, or <a href="https://github.com/ActsAsParanoid/acts_as_paranoid?ref=blog.bemi.io" rel="noopener">ActsAsParanoid</a>. But if you decide to use them, be careful and make sure to read our <a href="https://blog.bemi.io/soft-deleting-chaos/" rel="noopener">blog post</a> about their danger and how they can lead to critical incidents.</li><li><strong>Data structure</strong>. The data structure can be hard to work with and query directly. The field names in JSON are shortened to 1-2 characters to save disk space but this worsens the readability. Selecting records with the included JSON can significantly decrease database query performance because of <a href="https://wiki.postgresql.org/wiki/TOAST?ref=blog.bemi.io" rel="noopener">PostgreSQL TOAST</a>, so be careful with <code>SELECT *&#xA0;...</code> SQL statements. It&#x2019;s also difficult to construct, for example, a timeline of all changes across multiple records without knowing and fetching them in advance.</li><li><strong>Complexity</strong>. Understanding and changing the code for complex triggers with hundreds of lines in SQL can be challenging. Just a single mistake in an SQL function can break all queries. The context passing can also be tricky. For example, if you use PostgreSQL with a connection pooler such as PgBouncer, you need to wrap your queries into a transaction because Logidze relies on <a href="https://www.postgresql.org/docs/current/sql-set.htm?ref=blog.bemi.io" rel="noopener">PostgreSQL local parameters</a>. But at the same time, if you use transactions, it&#x2019;s impossible to pass application context to changes that are triggered after &#x201C;commit&#x201D; Rails callbacks.</li></ul><h2 id="replication-log-based-solutions">Replication Log-Based Solutions</h2><figure class="kg-card kg-image-card"><img src="https://cdn-images-1.medium.com/max/1600/1*i3FrMjpJktnx1PoG5vDtYg.png" class="kg-image" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby" loading="lazy" width="2000" height="720"></figure><p><a href="https://github.com/BemiHQ/bemi-rails?ref=blog.bemi.io" rel="noopener"><strong>Bemi Rails</strong></a> uses the native PostgreSQL replication log called <a href="https://www.postgresql.org/docs/current/wal-intro.html?ref=blog.bemi.io" rel="noopener">Write-Ahead Log (WAL)</a> which records all changes before they are flushed on a disk.&#xA0;</p><p>Traditionally, the PostgreSQL WAL is used for data recovery after a database crash by replaying records or replicating changes to standby read replicas. Bemi uses the same functionality:</p><ul><li><a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io" rel="noopener">Bemi Core</a> connects to the PostgreSQL WAL like a standby replica. It ingests and logically decodes all changes asynchronously and then stores them with the before/after states.</li><li>Bemi Rails allows setting the application context and passing it directly to the WAL with data changes without the need to update the database structure.</li></ul><pre><code class="language-ruby"># Custom context
Bemi.set_context(
  user_id: current_user.id,
  endpoint: &quot;#{request.method} #{request.path}&quot;,
)

# Data change that will be passed with the context into the PostgreSQL WAL
my_record.update!(published: true)</code></pre><h3 id="pros-2">Pros</h3><ul><li><strong>Reliability and accuracy</strong>. The PostgreSQL WAL is the source of truth for all data changes. Data changes will be captured even when they are produced by executing a direct SQL query within or outside the application.</li><li><strong>Performance</strong>. Audit logs are ingested asynchronously without affecting runtime performance. The application context is passed to the WAL directly from the application, but it has a minimal performance impact since it doesn&#x2019;t get stored and processed as regular PostgreSQL data.</li></ul><h3 id="cons-2">Cons</h3><ul><li><strong>Infrastructure complexity</strong>. Ingesting logically decoded changes requires running a separate worker process that connects to the database&#x2019;s replication log. This can be similar to or even more challenging than trying to run a self-managed database replica instance in a cluster. For example, this solution requires creating a replication slot and maintaining the ingested position in the WAL, implementing heartbeats, ingesting and serializing logically decoded WAL records, stitching them with application context, etc.</li><li><strong>Scalability</strong>. Similarly to the callback-based solutions, all audit records are stored in a single table. At scale, this table can become difficult to query and costly to manage.</li></ul><p><em>Full disclosure: I&#x2019;m one of the Bemi core contributors. Check out our </em><a href="https://bemi.io/?ref=blog.bemi.io" rel="noopener"><em>Bemi.io</em></a><em> cloud platform if you want to enable an automatic audit trail without the need to manage the infrastructure and deal with scalability issues yourself.</em></p><h2 id="manual-tracking">Manual Tracking</h2><figure class="kg-card kg-image-card"><img src="https://cdn-images-1.medium.com/max/1600/1*RA3qzROOSLAPkbI9ILxB6g.png" class="kg-image" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby" loading="lazy" width="2000" height="720"></figure><p><a href="https://github.com/public-activity/public_activity?ref=blog.bemi.io" rel="noopener"><strong>PublicActivity</strong></a><strong> </strong>is a gem similar to callback-based solutions that track data changes. Its main difference is that it also allows creating custom activity events for database records that can be serialized and translated with <a href="https://guides.rubyonrails.org/i18n.html?ref=blog.bemi.io" rel="noopener">Rails i18n</a>.</p><pre><code class="language-ruby">my_record.create_activity(
  key: &apos;my_model.commented_on&apos;,
  owner: current_user
)</code></pre><p><a href="https://github.com/ankane/ahoy?ref=blog.bemi.io" rel="noopener"><strong>Ahoy</strong></a> allows tracking and collecting analytics data in a Ruby on Rails application. It is similar to, for example, automatic page visit tracking in Google Analytics. But it can also record custom events in controllers.</p><pre><code class="language-ruby">def update
  ahoy.track(&apos;Updated&apos;, endpoint: &quot;#{request.method} #{request.path}&quot;)
  MyModel.find(params[:id]).update!(published: true)
end</code></pre><h3 id="pros-3">Pros</h3><ul><li><strong>Versatility</strong>. Creating custom audit trail records manually can be useful when it is necessary to record activities that didn&#x2019;t change data or didn&#x2019;t map 1-to-1 to records&#x2019; data changes.</li></ul><h3 id="cons-3">Cons</h3><ul><li><strong>Cumbersomeness</strong>. These solutions require manually triggering all actions that need to be recorded and updating the codebase in many places. This can be time-consuming and can increase code complexity. It is also easy to forget to trigger the right action, which may lead to an incomplete audit log.</li><li><strong>Flexibility</strong>. While these solutions give some manual control, they are limited either to controller actions or ActiveRecord models. This may not be flexible enough to record all system activities. For example, recording that a background processing job made changes via an API request in an external system, such as a payment processing service.</li></ul><h2 id="console-command-logging">Console Command&#xA0;Logging</h2><figure class="kg-card kg-image-card"><img src="https://cdn-images-1.medium.com/max/1600/1*SSKjFWjna5r30v567xadKA.png" class="kg-image" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby" loading="lazy" width="2000" height="720"></figure><p><a href="https://github.com/basecamp/console1984?ref=blog.bemi.io" rel="noopener"><strong>Console1984</strong></a> forces developers to specify a reason when they load a Rails console and record it with all executed console commands by storing them in a database.</p><pre><code class="language-bash">$ rails c

Bob, why are you using this console today?
&gt; Migrating customer data, see ticket #781923

&gt; user = User.find(...)
...</code></pre><p>In a Rails console, it breaks down a session into two access modes. One is the regular &#x201C;protected&#x201D; mode available after specifying a Rails console access reason. The other one is the &#x201C;sensitive&#x201D; mode which requires additional explicit consent when accessing sensitive information, such as executing a method that decrypts sensitive data or making external HTTP requests.</p><p>It also comes with the UI after installing the <a href="https://github.com/basecamp/audits1984?ref=blog.bemi.io" rel="noopener"><strong>Audits1984</strong></a> gem. This gem allows reviewing console sessions by approving or flagging them and leaving comments.</p><h3 id="pros-4">Pros</h3><ul><li><strong>Auditable console sessions</strong>. Manually executed commands by developers can automatically be logged and reviewed later.</li></ul><h3 id="cons-4">Cons</h3><ul><li><strong>Loose control</strong>. In Ruby, it is very easy to modify any class and method definitions dynamically. It means that someone who has access to a Rails console can find workarounds and execute some commands that won&#x2019;t be logged. To make logging more reliable and improve internal controls, teams may want to disable production console access and build workflows for running only pre-approved scripts.</li></ul><h2 id="custom-logging">Custom Logging</h2><figure class="kg-card kg-image-card"><img src="https://cdn-images-1.medium.com/max/1600/1*4-0FLfRJN5prxrowiHMnQA.png" class="kg-image" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby" loading="lazy" width="2000" height="720"></figure><p>Ruby on Rails logging functionality allows logging anything in any text format.</p><pre><code class="language-ruby">payload = {
  user_id: current_user.id,
  endpoint: &quot;#{request.method} #{request.path}&quot;,
}

Rails.logger.info(&quot;CUSTOM_LOG_MESSAGE: #{payload.to_json}&quot;)</code></pre><p>Starting with Ruby on Rails version 7, previously with the <a href="https://github.com/basecamp/marginalia?ref=blog.bemi.io" rel="noopener"><strong>Marginalia</strong></a> gem, it is also possible to pass custom application context via <code>ActiveSupport::CurrentAttributes</code> with ActiveRecord logs.</p><pre><code class="language-ruby">Current.user = current_user.id
Current.endpoint = &quot;#{request.method} #{request.path}&quot;

config.active_record.query_log_tags = [
  {
    user_id: -&gt; { Current.user.id },
    endpoint: -&gt; { Current.endpoint },
  },
]

MyRecord.all
# Account Load (0.3ms)  SELECT `my_records`.* FROM `my_records`
# /*user_id:1,endpoint:POST /my_records*/
</code></pre><h3 id="pros-5">Pros</h3><ul><li><strong>Flexibility</strong>. These are the most flexible solutions that allow recording activities in custom format in text logs.</li></ul><h3 id="cons-5">Cons</h3><ul><li><strong>Consumption</strong>. Collecting unstructured text logs across all application instances and consuming them might be challenging. Depending on the use case, it might be required to clean up the logs, parse them, and save them in some data storage in a more structured format that allows quick lookups with filters. It might also be required to aggregate the logs by a transaction, an API request, etc. For example, if one log entry says that a record was created by a user, there could be another entry that says that this record creation was not committed and was rolled back.</li></ul><hr><h2 id="conclusion">Conclusion</h2><p>There is a large number of Ruby gems available that can help with building an audit trail. As a rule of thumb, you can choose the right tool or a combination depending on your needs:</p><ul><li><strong>Basic change tracking</strong>:<strong> </strong>if you need it for troubleshooting purposes, you can use <a href="https://github.com/collectiveidea/audited?ref=blog.bemi.io" rel="noopener">Audited</a>, which also allows automatically deleting all audit records keeping only the last N numbers of changes.</li><li><strong>Change diffing and rollbacks</strong>: for basic change tracking with additional features for querying them, <a href="https://github.com/paper-trail-gem/paper_trail?ref=blog.bemi.io" rel="noopener">PaperTrail</a> is your best choice.</li><li><strong>MongoDB and Mongoid</strong>: if you use Mongoid object-document mapper, then go with <a href="https://github.com/mongoid/mongoid-history?ref=blog.bemi.io" rel="noopener">Mongoid History</a>.</li><li><strong>Performance over deletion tracking and simplicity</strong>: if you use PostgreSQL, then you can choose <a href="https://github.com/palkan/logidze?ref=blog.bemi.io" rel="noopener">Logidze</a>.</li><li><strong>Reliability with zero performance overhead</strong>: if you use PostgreSQL and need complete change tracking accuracy and reliability without runtime overhead, go with <a href="https://github.com/BemiHQ/bemi-rails?ref=blog.bemi.io" rel="noopener">Bemi Rails</a>.</li><li><strong>Simple activity feed UIs</strong>: if you need to build a simple activity feed constructed around your records, then go with <a href="https://github.com/public-activity/public_activity?ref=blog.bemi.io" rel="noopener">PublicActivity</a>, which also supports i18n for multi-language interfaces.</li><li><strong>HTTP request tracking</strong>: if you need to track HTTP requests in a structured format, then <a href="https://github.com/ankane/ahoy?ref=blog.bemi.io" rel="noopener">Ahoy</a> is your best choice.</li><li><strong>Console session auditing</strong>: if you need to log and audit the commands executed in Rails consoles, go with <a href="https://github.com/basecamp/console1984?ref=blog.bemi.io" rel="noopener">Console1984</a> and <a href="https://github.com/basecamp/audits1984?ref=blog.bemi.io" rel="noopener">Audits1984</a>.</li><li><strong>Troubleshooting recent issues</strong>: you can use application logs to troubleshoot issues and use <a href="https://github.com/basecamp/marginalia?ref=blog.bemi.io" rel="noopener">Marginalia</a> to automatically annotate log entries to add more context.</li></ul><hr><figure class="kg-card kg-image-card"><img src="https://blog.bemi.io/content/images/2024/07/Untitled.jpg" class="kg-image" alt="Choosing the Right Audit Trail Approach in&#xA0;Ruby" loading="lazy" width="2000" height="1047" srcset="https://blog.bemi.io/content/images/size/w600/2024/07/Untitled.jpg 600w, https://blog.bemi.io/content/images/size/w1000/2024/07/Untitled.jpg 1000w, https://blog.bemi.io/content/images/size/w1600/2024/07/Untitled.jpg 1600w, https://blog.bemi.io/content/images/2024/07/Untitled.jpg 2400w" sizes="(min-width: 720px) 720px"></figure><p><em>Also check out the new </em><a href="https://topenddevs.com/podcasts/ruby-rogues/episodes/navigating-sql-data-changes-tools-and-techniques-for-data-recovery-ruby-645?ref=blog.bemi.io" rel="noreferrer"><em>Ruby Rogues podcast episode</em></a><em> where we talk about tools, patterns, and techniques&#xA0;for data recovery in more detail.</em></p>]]></content:encoded></item><item><title><![CDATA[How Change Data Capture Powers Modern Apps]]></title><description><![CDATA[CDC is becoming an increasingly popular software pattern, with dev tooling startups centered around CDC having cumulatively raised nearly a billion dollars in funding in recent years. The surge in CDC's popularity begs the questions: why has it become so important and how does it work?]]></description><link>https://blog.bemi.io/cdc/</link><guid isPermaLink="false">66160245398ea80001a2e26f</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Thu, 11 Apr 2024 20:29:16 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/04/Group-15800.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/04/Group-15800.png" alt="How Change Data Capture Powers Modern Apps"><p>At its core, Change Data Capture (CDC) is a method used to track insert, update, and delete operations made to a database.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/04/661606e3ebeba037394102-1.gif" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy" width="1199" height="274" srcset="https://blog.bemi.io/content/images/size/w600/2024/04/661606e3ebeba037394102-1.gif 600w, https://blog.bemi.io/content/images/size/w1000/2024/04/661606e3ebeba037394102-1.gif 1000w, https://blog.bemi.io/content/images/2024/04/661606e3ebeba037394102-1.gif 1199w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://blog.bytebytego.com/p/ep92-top-5-kafka-use-cases?ref=blog.bemi.io"><span style="white-space: pre-wrap;">https://blog.bytebytego.com/p/ep92-top-5-kafka-use-cases</span></a></figcaption></figure><p>CDC is becoming an increasingly popular software pattern, with dev tooling startups centered around CDC such as <a href="https://airbyte.com/?ref=blog.bemi.io" rel="noopener noreferrer">Airbyte</a> and <a href="https://www.fivetran.com/?ref=blog.bemi.io" rel="noopener noreferrer">Fivetran</a> having cumulatively raised nearly a billion dollars in funding in recent years. The surge in CDC&apos;s popularity begs the questions: why has it become so important to today&#x2019;s developer, and how does it work?</p><h2 id="why-now">Why now?</h2><p>CDC isn&#x2019;t exactly new, but its surge in popularity can be attributed to a few key reasons.</p><h3 id="data-fragmentation-and-growth"><strong>Data fragmentation and growth</strong></h3><p>Not only is the amount of data that applications now generate exploding, but the data is increasingly fragmented between various isolated databases, making it a nightmare to keep everything in sync. CDC captures changes across independent data sources, allowing you to unify your data and ensure everyone has the correct info.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/04/image.png" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy" width="1432" height="884" srcset="https://blog.bemi.io/content/images/size/w600/2024/04/image.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/04/image.png 1000w, https://blog.bemi.io/content/images/2024/04/image.png 1432w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://www.statista.com/statistics/871513/worldwide-data-created/?ref=blog.bemi.io"><span style="white-space: pre-wrap;">Exponential growth of data volumes</span></a></figcaption></figure><h3 id="real-time-demands"><strong>Real-time demands</strong></h3><p>Application data typically flows downstream to a data warehouse periodically on a schedule to be processed for analytics. Today&apos;s applications need to react to data changes as they happen, not wait for batch updates. For example, to be able to make faster decisions by not having stale data on dashboards. Since CDC lets you react to changes as they happen, it enables real-time analytics and event-driven architectures.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.scylladb.com/wp-content/uploads/Event-Driven-Architecture-diagram.png" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy"><figcaption><a href="https://www.scylladb.com/glossary/event-driven-architecture/?ref=blog.bemi.io"><span style="white-space: pre-wrap;">https://www.scylladb.com/glossary/event-driven-architecture/</span></a></figcaption></figure><h3 id="ease-of-adoption"><strong>Ease of adoption</strong></h3><p>The CDC infrastructure ecosystem has matured to a point where it&apos;s now practical for companies at all stages. Open-source projects like <a href="https://github.com/debezium/debezium?ref=blog.bemi.io">Debezium</a> and <a href="https://github.com/apache/kafka?ref=blog.bemi.io">Kafka</a> have made it easier to build systems that continuously react to data changes. These tools provide the robustness, scalability, and performance needed to process and distribute large volumes of change data in real-time. As CDC continues to get more approachable, it&apos;s igniting a surge in demand, creating a powerful feedback loop that&apos;s leading to even more tooling development efforts.</p><h2 id="how-does-cdc-work">How does CDC work?</h2><p>There are three main types of CDC implementations:</p><ol><li>Log-based captures changes from the existing database transaction logs and is the newest approach.</li><li>Query-based periodically queries the database to identify changes. This approach is simpler to set up but won&apos;t capture delete operations.</li><li>Trigger-based relies on database triggers to capture changes and write them to a change table. This approach reduces database performance since it requires multiple writes on each data change.</li></ol><p>Log-based CDC is quickly becoming embraced as the de facto approach because it&apos;s the least invasive and most efficient. It involves a few steps:</p><ol><li><strong>Log creation</strong>: When a change is made to the database, a log entry is created that captures the details of the change.</li><li><strong>Log consumption</strong>: The change data is processed and made available for use.</li><li><strong>Data distribution</strong>: The data is distributed to the desired systems, such as a data warehouse, cache, or search index.</li></ol><p>Let&#x2019;s take a closer look at each step.</p><h3 id="log-creation"><strong>Log creation</strong></h3><p>Before a database such as PostgreSQL, MySQL, MongoDB, and SQLite stores data to disk, it first writes it to a transaction log.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/04/image-1.png" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy" width="2000" height="256" srcset="https://blog.bemi.io/content/images/size/w600/2024/04/image-1.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/04/image-1.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/04/image-1.png 1600w, https://blog.bemi.io/content/images/2024/04/image-1.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Database transaction logs creation</span></figcaption></figure><p>This write-ahead logging technique allows writes to be more performant since the database can just do the lightweight log append operation before asynchronously making changes to the actual files and indexes. These transaction logs primarily serve as the database&apos;s source of truth to fall back on in case of a failure.</p><p>In Postgres, the volume of information recorded in the Write-Ahead Logs (WAL) can be adjusted. The <code>wal_level</code> setting offers three options, in ascending order of information logged: <code>minimal</code>, <code>replica</code>, and <code>logical</code>. CDC leverages these existing logs as the source of truth of all data changes, but requires a <code>logical</code> setting that enables changes to be read row-by-row, instead of by the physical disk blocks.</p><figure class="kg-card kg-code-card"><pre><code class="language-sql">SHOW wal_level;
+-------------+
| wal_level   |
|-------------|
| logical     |
+-------------+
</code></pre><figcaption><p><span style="white-space: pre-wrap;">SQL command to check PostgreSQL&apos;s wal_level</span></p></figcaption></figure><p>The format and structure of the transaction logs depend on the implementation of the database type. For instance, MySQL generates a binlog, while MongoDB uses oplogs.</p><h3 id="log-consumption"><strong>Log consumption</strong></h3><p>Fortunately, open-source projects like <a href="https://github.com/debezium/debezium?ref=blog.bemi.io">Debezium</a> can now do most of the hard work of consuming entries from the transaction log and abstracting away the database implementation details with connectors that just produce generic abstract events.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0845002f-8497-4e71-af9e-b711074a6dfe_1600x999.png" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy"><figcaption><a href="https://blog.bytebytego.com/p/reddits-architecture-the-evolutionary?ref=blog.bemi.io"><span style="white-space: pre-wrap;">https://blog.bytebytego.com/p/reddits-architecture-the-evolutionary</span></a></figcaption></figure><p>The Postgres connector relies on <a href="https://www.postgresql.org/docs/current/static/protocol-replication.html?ref=blog.bemi.io">PostgreSQL&#x2019;s replication protocol</a> to access changes in real-time from the server&#x2019;s transaction logs. It then transforms this information into a specific format, such as Protobuf or JSON, and sends it to an output destination. Each event gets structured as a key/value pair, where the key represents the primary key of the table, and the value includes details such as the before and after states of the change, along with additional metadata.</p><figure class="kg-card kg-code-card"><pre><code class="language-jsx">{
  &quot;schema&quot;: { ... },
  &quot;payload&quot;: {
    &quot;before&quot;: {
      &quot;id&quot;: 1,
      &quot;first_name&quot;: &quot;Mary&quot;,
      &quot;last_name&quot;: &quot;Samsonite&quot;,
    }
    &quot;after&quot;: {
      &quot;id&quot;: 1,
      &quot;first_name&quot;: &quot;Mary&quot;,
      &quot;last_name&quot;: &quot;Swanson&quot;,
    }
  },
  &quot;source&quot;: {
    &quot;connector&quot;: &quot;postgresql&quot;,
    &quot;name&quot;: &quot;server1&quot;,
    &quot;ts_ms&quot;: 1559033904863,
    &quot;snapshot&quot;: true,
    &quot;db&quot;: &quot;postgres&quot;,
    &quot;sequence&quot;: &quot;[\\&quot;24023119\\&quot;,\\&quot;24023128\\&quot;]&quot;,
    &quot;schema&quot;: &quot;public&quot;,
    &quot;table&quot;: &quot;customers&quot;,
    &quot;txId&quot;: 555,
    &quot;lsn&quot;: 24023128,
    &quot;xmin&quot;: null,
  },
  &quot;op&quot;: &quot;c&quot;,
  &quot;ts_ms&quot;: 1559033904863
}
</code></pre><figcaption><p><span style="white-space: pre-wrap;">Example Update event</span></p></figcaption></figure><h3 id="data-distribution"><strong>Data distribution</strong></h3><p>CDC systems typically incorporate a message broker component to propagate the <a href="https://github.com/debezium/debezium?ref=blog.bemi.io">Debezium</a> events. <a href="https://github.com/apache/kafka?ref=blog.bemi.io">Apache Kafka</a> stands out for this purpose because of a few advantages: scalability to handle large volumes of data, persistence of messages, guaranteed ordering per partition, and compaction capability, where multiple changes on the same record can optionally be easily rolled into one. From the message queue, client applications can then read events that correspond to the database tables of interest, and react to every row-level event they receive.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/04/image-5.png" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy" width="2000" height="420" srcset="https://blog.bemi.io/content/images/size/w600/2024/04/image-5.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/04/image-5.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/04/image-5.png 1600w, https://blog.bemi.io/content/images/2024/04/image-5.png 2399w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">CDC distributed message queue system</span></figcaption></figure><h2 id="patterns">Patterns</h2><p>There are countless use cases where CDC systems are invaluable. You can use them to build notification systems instead of relying on callbacks, to invalidate caches, to update search indexes, to migrate data without downtime, to update vector embeddings, or to perform point-in-time data recovery, to name a few. I&#x2019;ll highlight below some common CDC system patterns I&apos;ve personally seen in production environments.</p><h3 id="microservice-synchronization">Microservice synchronization</h3><p>In a microservice based architecture, each service often maintains its own standalone database. For instance, a user service might handle user data, while a friends service manages friend-related information. You might want to combine the data into a materialized view or replicate it to Elasticsearch to power queries such as &#x201C;give me a user named Mary who has 2 friends&#x201D;. CDC facilitates the decoupling of systems by enabling real-time data sharing across different components without direct message passing, thus supporting the scalability and flexibility required by these architectures.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/04/image-6.png" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy" width="1305" height="730" srcset="https://blog.bemi.io/content/images/size/w600/2024/04/image-6.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/04/image-6.png 1000w, https://blog.bemi.io/content/images/2024/04/image-6.png 1305w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Optimized decoupled local views</span></figcaption></figure><h3 id="audit-trails">Audit Trails</h3><p>CDC offers the most reliable and performant approach for building robust audit trails. The low-level data change events can be stitched with additional metadata to better record who made the change and why it was made. I&apos;m one of the contributors to <a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io" rel="noreferrer">Bemi</a>, an open-source tool that provides automatic audit trails, and we did this by creating libraries that inserted additional custom application-specific context (userID, API endpoint, etc.) in the database transaction logs using a similar technique to <a href="https://google.github.io/sqlcommenter/?ref=blog.bemi.io">Google&apos;s Sqlcommenter</a>. We stitch this information together in a CDC system and then store the enriched data in a queryable database. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/04/image-7.png" class="kg-image" alt="How Change Data Capture Powers Modern Apps" loading="lazy" width="2000" height="349" srcset="https://blog.bemi.io/content/images/size/w600/2024/04/image-7.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/04/image-7.png 1000w, https://blog.bemi.io/content/images/size/w1600/2024/04/image-7.png 1600w, https://blog.bemi.io/content/images/2024/04/image-7.png 2327w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Audit trails CDC architecture</span></figcaption></figure><h2 id="conclusion">Conclusion</h2><p>As demand for CDC grows, understanding it is becoming increasingly essential for today&apos;s developers. And as developer tooling in this space continues to improve, the countless use cases powered by CDC will continue to get more accessible.</p><p>I&apos;ve intentionally glossed over a lot of CDC details in this blog to keep it short. But I&apos;d recommend checking out the <a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io">Bemi source code</a> to see how CDC systems that have handled billions of data changes actually work under the hood!</p>]]></content:encoded></item><item><title><![CDATA[The Day Soft Deletes Caused Chaos]]></title><description><![CDATA[Discover the critical mistakes and lessons learned from using soft deletes in production systems. This blog post explores the complexities, data integrity issues, and alternative solutions to managing deleted data effectively.]]></description><link>https://blog.bemi.io/soft-deleting-chaos/</link><guid isPermaLink="false">65ea0504398ea80001a2def7</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Tue, 12 Mar 2024 17:48:15 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/03/Group-15733--2---2---2-.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/03/Group-15733--2---2---2-.png" alt="The Day Soft Deletes Caused Chaos"><p>The worst mistake I made in my software engineering career was merging a seemingly harmless pull request 5 years ago.</p><p>TLDR: Soft deletes <strong>should not</strong> be used in production-grade systems&#x2014;a lesson I learned the hard way when a severe mishap enabled the sale of the same concert seats to unlimited buyers.</p><p>Soft deletion is the easiest way to store deleted data and means just setting a <code>deleted</code> flag instead of performing a <code>DELETE</code> operation directly.</p><figure class="kg-card kg-code-card"><pre><code class="language-sql">+--------------------------------------+------------+------------+---------+
| id                                   | last_name  | first_name | deleted |
+--------------------------------------+------------+------------+---------+
| 12778d88-41c8-4fc2-8be6-68c5d51c3893 | Samsonite  | Mary       | true    | 
+--------------------------------------+------------+------------+---------+</code></pre><figcaption><p><span style="white-space: pre-wrap;">Deleted user using soft deletion</span></p></figcaption></figure><p>You add a new column on a table, perform an update when deleting, and filter out deleted data when querying.</p><figure class="kg-card kg-code-card"><pre><code class="language-sql">SELECT *
FROM user
WHERE id = $1
    AND deleted IS NULL;</code></pre><figcaption><p><span style="white-space: pre-wrap;">Querying with soft deletion</span></p></figcaption></figure><p>Although this approach is simple to set up, it can lead to dangerous consequences if it accidentally returns data not meant to be seen.</p><h2 id="critical-incident">Critical Incident</h2><p>I was working at an events ticketing company and I created a pull request that was similar to this:</p><figure class="kg-card kg-code-card"><pre><code class="language-Ruby">class SeatClaim 
... 
- acts_as_paranoid
... 
+ def remove 
+   move_to_expired
+   destroy
+ end
...
end
</code></pre><figcaption><p><span style="white-space: pre-wrap;">app/models/seat_claim.rb</span></p></figcaption></figure><figure class="kg-card kg-code-card"><pre><code class="language-Ruby">+ class MigrateDeletedSeatClaims &lt; Migration
+  def self.up
+    expired_seat_claims = SeatClaim.where.not(deleted_at: nil)
+    expired_seat_claims.each(&amp;:remove)
+  end 
+
+  def self.down 
... 
+ end </code></pre><figcaption><p><span style="white-space: pre-wrap;">migrations/19700101000000_migrate_deleted_seat_claims.rb</span></p></figcaption></figure><p>In the seating reservation experience, you could claim a seat for 5 minutes during the checkout flow before a background job would delete the claim and release the seat to be bookable again. I was migrating from soft deleting seat claim&#x2019;s to a new collection explicitly meant for storing the deleted rows.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/03/image-1.png" class="kg-image" alt="The Day Soft Deletes Caused Chaos" loading="lazy" width="1486" height="1020" srcset="https://blog.bemi.io/content/images/size/w600/2024/03/image-1.png 600w, https://blog.bemi.io/content/images/size/w1000/2024/03/image-1.png 1000w, https://blog.bemi.io/content/images/2024/03/image-1.png 1486w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Seating Reservation UX</span></figcaption></figure><p>The incident was caused because of this line:</p><pre><code class="language-ruby">- acts_as_paranoid
</code></pre><p>This removed the <a href="https://github.com/rubysherpas/paranoia?ref=blog.bemi.io">Paranoia</a> library on the model that had abstracted away the soft deletion logic i.e. setting a <code>deleted_at</code> field to the current time when you&#xA0;delete a record. What wasn&apos;t top of mind for me was that it also automatically filtered out all soft-deleted records in the ORM. </p><p>Without the automatic exclusion of soft-deleted records and while the migration hadn&apos;t finished, the background worker began collecting claims that had already been &quot;deleted&quot; - inadvertently causing seats that were successfully paid for to be released and available for booking again!</p><p>I&#x2019;ll never forget the sinking feeling and sense of dread when I realized what was happening.</p><p>This meant that the same seat at a Shawn Mendes concert was being sold multiple times over. Amplified by lots of seats, amplified by lots of events around the globe! Yeah it was bad.</p><p>To be fair, soft deletes weren&apos;t the lone culprit and there was a lot I should have done differently for this change, like breaking it out into multiple steps. Automated tests in the CI/CD pipeline should have caught this error, but it managed to slip through the cracks. Luckily there was a lot of observability in this area, so it was detected and remediated almost immediately. But the impact and fallout was still severe with hundreds of double bookings that had to be refunded, orders cancelled, apology emails sent to affected customers, and a late night postmortem written. </p><h2 id="don%E2%80%99t-soft-delete">Don&#x2019;t Soft Delete</h2><p>The instinct to retain deleted data is understandable, even within the regulatory landscape of <a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation?ref=blog.bemi.io">GDPR</a>. Developers may need it for compliance, reporting, analytics, or just as a safety net &#x2013; a chance to recover from accidental deletions or to examine a deleted record for troubleshooting. Imagine a customer accidentally deleting a crucial invoice, or a social media user deleting a comment that broke the rules. Keeping deleted data for a grace period can be valuable. However, the soft deletes approach creates more problems than it solves.</p><h3 id="complexity">Complexity</h3><p>Soft deletion infects everything and complicates queries. The application ORM layer usually automatically filters out &quot;deleted&quot; records, but this convenience can lead to oversight when constructing complex SQL queries manually. Like me, you might end up retrieving inaccurate results, potentially exposing sensitive data or making bad decisions based on incomplete information. Yes, creating a database <a href="https://en.wikipedia.org/wiki/View_(SQL)?ref=blog.bemi.io">View</a> is safer, but it&#x2019;s still extra complexity and an unneeded appendage. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://pbs.twimg.com/media/E5r2BEfVgAIs8EW.jpg:large" class="kg-image" alt="The Day Soft Deletes Caused Chaos" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Murphy&apos;s Law: anything that can go wrong will go wrong</span></figcaption></figure><p>Indexes, unique constraints, and foreign key relationships all also need to consider the &quot;deleted&quot; state, making them more intricate to create and maintain. </p><figure class="kg-card kg-code-card"><pre><code class="language-sql">CREATE UNIQUE INDEX unique_active_users_email ON users (email)
WHERE deleted_at IS NULL;
</code></pre><figcaption><p><span style="white-space: pre-wrap;">Creating a unique index on the email field for active users</span></p></figcaption></figure><p>Even with adding <a href="https://en.wikipedia.org/wiki/Partial_index?ref=blog.bemi.io">partial indexes</a>, soft deletes can lead to significant bloat, adversely affecting table size and performance. In high-volume environments, this can become a bigger issue and require performance tuning or data partitioning to maintain efficiency.</p><h3 id="data-integrity">Data Integrity</h3><p>Handling deletion in the application layer via soft deletes loses one of the benefits of the database, which tries to keep data valid for you. </p><figure class="kg-card kg-code-card"><pre><code class="language-sql">ERROR: delete on table &quot;users&quot; violates foreign key constraint &quot;orders_user_id_fkey&quot; on table &quot;orders&quot;
DETAIL: Key (id)=(456) is still referenced from table &quot;orders&quot;.
</code></pre><figcaption><p><span style="white-space: pre-wrap;">Database foreign key violation error</span></p></figcaption></figure><p>Enforcing referential integrity on your own can be error prone and adds significant development and maintenance overhead.</p><h2 id="alternatives-to-soft-deletes">Alternatives to Soft Deletes</h2><p>An alternative to soft deletes is to just archive the deleted data into history tables. It&apos;s still simple to do and removes the long term liability and maintenance burden of soft deletes. This can be done by inserting the deleted record to a separate table before deleting.</p><figure class="kg-card kg-code-card"><pre><code class="language-sql">BEGIN;

-- Insert the SeatClaim record into SeatClaimHistory with deletion details
INSERT INTO SeatClaimHistory (id, user_id, seat_id, deleted_at)
SELECT id, user_id, seat_id, NOW() FROM SeatClaim WHERE id = $1

-- Delete the original SeatClaim record
DELETE FROM SeatClaim WHERE id = $1;

COMMIT;</code></pre><figcaption><p><span style="white-space: pre-wrap;">Transaction to archive data before deleting</span></p></figcaption></figure><p>If you don&apos;t want to manually archive data all over your codebase, the best alternative is building an audit trail at the database layer. The&#xA0;<a href="https://blog.bemi.io/the-ultimate-guide-to-postgresql-data-change-tracking/" rel="noopener noreferrer">Ultimate Guide to PostgreSQL Data Change Tracking</a> outlines the different strategies for PostgreSQL. I&apos;d also recommend checking out an open-source project I contribute to called <a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io" rel="noopener">Bemi</a>,&#xA0;which aims to simplify this by plugging into a database and application (support for lots of different ORMs e.g. <a href="https://github.com/BemiHQ/bemi-rails?ref=blog.bemi.io" rel="noreferrer">Bemi-rails</a>) to provide a record of contextualized data changes automatically.</p><h2 id="the-bottom-line">The Bottom Line</h2><p>Steer clear of soft deletes. They might look like the easy fix for managing deleted data, but trust me&#x2014;they&apos;re a ticking time bomb. I learned this the hard way years ago, and it&apos;s a mistake you don&apos;t want to repeat. Opt for history or audit tables instead. It&apos;s cleaner, safer, and will save you a world of trouble down the line. </p>]]></content:encoded></item><item><title><![CDATA[The Ultimate Guide to PostgreSQL Data Change Tracking]]></title><description><![CDATA[Explore five methods of data change tracking in PostgreSQL available in 2024.]]></description><link>https://blog.bemi.io/the-ultimate-guide-to-postgresql-data-change-tracking/</link><guid isPermaLink="false">65d97ce4e08447000140ff2e</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Evgeny Li]]></dc:creator><pubDate>Sat, 24 Feb 2024 05:29:43 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/02/og-image-.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/02/og-image-.png" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking"><p>PostgreSQL, one of the most popular databases, was named DBMS of the Year 2023 by <a href="https://db-engines.com/en/blog_post/106?ref=blog.bemi.io" rel="noopener">DB-Engines Ranking</a> and is used more than any other database among startups according to <a href="https://www.hntrends.com/2024/january.html?compare=PostgreSQL&amp;compare=MySQL&amp;compare=MongoDB&amp;compare=SQL+Server&amp;ref=blog.bemi.io" rel="noopener">HN Hiring Trends</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*d2nVVieI0nqxZJhS2-BFeQ.png" class="kg-image" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking" loading="lazy" width="2626" height="822"><figcaption><span style="white-space: pre-wrap;">PostgreSQL is the most popular database among&#xA0;startups</span></figcaption></figure><p>The SQL standard has included features related to <a href="https://en.wikipedia.org/wiki/Temporal_database?ref=blog.bemi.io" rel="noopener">temporal databases</a> since 2011, which allow storing data changes over time rather than just the current data state. However, relational databases don&#x2019;t completely follow the standards. In the case of PostgreSQL, it doesn&#x2019;t support these features, even though there has been a submitted <a href="https://www.postgresql.org/message-id/flat/CALAY4q-cXCD0r4OybD%3Dw7Hr7F026ZUY6%3DLMsVPUe6yw_PJpTKQ%40mail.gmail.com?ref=blog.bemi.io" rel="noopener">patch</a> with some discussions.</p><p>There are PostgreSQL extensions like <a href="https://github.com/xocolatl/periods?ref=blog.bemi.io" rel="noopener">periods</a> and <a href="https://github.com/arkhipov/temporal_tables?ref=blog.bemi.io" rel="noopener">temporal_tables</a> that add support for temporal tables. Unfortunately, cloud providers such as AWS, Azure, and GCP don&#x2019;t allow running custom C extensions with managed databases.</p><p>Let&#x2019;s explore five alternative methods of data change tracking in PostgreSQL available to us in 2024.</p><h2 id="triggers-and-audit-table">Triggers and Audit&#xA0;Table</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*z33FB139uoaUOy62JgeS-Q.png" class="kg-image" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking" loading="lazy" width="1920" height="1008"><figcaption><span style="white-space: pre-wrap;">A PostgreSQL trigger with an audit&#xA0;table</span></figcaption></figure><p>PostgreSQL allows adding triggers with custom procedural SQL code performed on row changes with <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> queries. The official PostgreSQL wiki describes a generic <a href="https://wiki.postgresql.org/wiki/Audit_trigger?ref=blog.bemi.io" rel="noopener">audit trigger function</a>. Let&#x2019;s have a quick look at a simplified example.</p><p>First, create a table called <code>logged_actions</code> in a separate schema called <code>audit</code>:</p><pre><code class="language-sql">CREATE schema audit;

CREATE TABLE audit.logged_actions (
  schema_name TEXT NOT NULL,
  table_name TEXT NOT NULL,
  user_name TEXT,
  action_tstamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT current_timestamp,
  action TEXT NOT NULL CHECK (action IN (&apos;I&apos;,&apos;D&apos;,&apos;U&apos;)),
  original_data TEXT,
  new_data TEXT,
  query TEXT
);</code></pre><p>Next, create a function to insert audit records and establish a trigger on a table you wish to track, such as <code>my_table</code>:</p><pre><code class="language-sql">CREATE OR REPLACE FUNCTION audit.if_modified_func() RETURNS TRIGGER AS $body$
BEGIN
  IF (TG_OP = &apos;UPDATE&apos;) THEN
    INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,new_data,query)
    VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),ROW(OLD.*),ROW(NEW.*),current_query());
    RETURN NEW;
  elsif (TG_OP = &apos;DELETE&apos;) THEN
    INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,query)
    VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),ROW(OLD.*),current_query());
    RETURN OLD;
  elsif (TG_OP = &apos;INSERT&apos;) THEN
    INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,new_data,query)
    VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),ROW(NEW.*),current_query());
    RETURN NEW;
  END IF;
END;
$body$
LANGUAGE plpgsql;

CREATE TRIGGER my_table_if_modified_trigger
AFTER INSERT OR UPDATE OR DELETE ON my_table
FOR EACH ROW EXECUTE PROCEDURE if_modified_func();</code></pre><p>Once it&#x2019;s done, row changes made in <code>my_table</code> will create records in <code>audit.logged_actions</code>:</p><pre><code class="language-sql">INSERT INTO my_table(x,y) VALUES (1, 2);
SELECT * FROM audit.logged_actions;</code></pre><p>If you want to further improve this solution by using JSONB columns instead of TEXT, ignoring changes in certain columns, pausing auditing a table, and so on, check out the SQL example in this <a href="https://github.com/2ndQuadrant/audit-trigger?ref=blog.bemi.io" rel="nofollow noopener noopener noopener">audit-trigger</a> repo and its forks.</p><p>Another alternative is the <a href="https://github.com/nearform/temporal_tables?ref=blog.bemi.io" rel="noopener">temporal_tables</a> implementation written by using triggers. The main difference is that it stores records in a separate table with a time range during which a version was valid, not just an initial timestamp when a change was recorded. This makes it easier to perform time travel queries by selecting records that were valid at a specific point in time.</p><h3 id="downsides">Downsides</h3><ul><li>Performance. Triggers add performance overhead by inserting additional records synchronously on every <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> operation.</li><li>Security. Anyone with superuser access can modify the triggers and make unnoticed data changes. It is also recommended to make sure that records in the audit table cannot be modified or removed.</li><li>Maintenance. Managing complex triggers across many constantly changing tables can become cumbersome. Making a small mistake in an SQL script can break queries or data change tracking functionality.</li></ul><h2 id="triggers-and-notifylisten">Triggers and Notify/Listen</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*XWnUtj_RQuBClo2FdFZEbQ.png" class="kg-image" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking" loading="lazy" width="2400" height="832"><figcaption><span style="white-space: pre-wrap;">A PostgreSQL trigger with&#xA0;Notify</span></figcaption></figure><p>This approach is similar to the previous one but instead of writing data changes in the audit table directly, we pass them through a pub/sub mechanism through a trigger to another system dedicated to reading and storing these data changes:</p><pre><code class="language-sql">CREATE OR REPLACE FUNCTION if_modified_func() RETURNS TRIGGER AS $body$
BEGIN
  IF (TG_OP = &apos;UPDATE&apos;) THEN
    PEFORM pg_notify(&apos;data_changes&apos;, json_build_object(
      &apos;schema_name&apos;, TG_TABLE_SCHEMA::TEXT,
      &apos;table_name&apos;, TG_TABLE_NAME::TEXT,
      &apos;user_name&apos;, session_user::TEXT,
      &apos;action&apos;, substring(TG_OP,1,1),
      &apos;original_data&apos;, jsonb_build(OLD),
      &apos;new_data&apos;, jsonb_build(NEW)
    )::TEXT);
    RETURN NEW;
  elsif (TG_OP = &apos;DELETE&apos;) THEN
    PEFORM pg_notify(&apos;data_changes&apos;, json_build_object(
      &apos;schema_name&apos;, TG_TABLE_SCHEMA::TEXT,
      &apos;table_name&apos;, TG_TABLE_NAME::TEXT,
      &apos;user_name&apos;, session_user::TEXT,
      &apos;action&apos;, substring(TG_OP,1,1),
      &apos;original_data&apos;, jsonb_build(OLD)
    )::TEXT);
    RETURN OLD;
  elsif (TG_OP = &apos;INSERT&apos;) THEN
    PEFORM pg_notify(&apos;data_changes&apos;, json_build_object(
      &apos;schema_name&apos;, TG_TABLE_SCHEMA::TEXT,
      &apos;table_name&apos;, TG_TABLE_NAME::TEXT,
      &apos;user_name&apos;, session_user::TEXT,
      &apos;action&apos;, substring(TG_OP,1,1),
      &apos;new_data&apos;, jsonb_build(NEW)
    )::TEXT);
    RETURN NEW;
  END IF;
END;
$body$
LANGUAGE plpgsql;

CREATE TRIGGER my_table_if_modified_trigger
AFTER INSERT OR UPDATE OR DELETE ON my_table
FOR EACH ROW EXECUTE PROCEDURE if_modified_func();</code></pre><p>Now it&#x2019;s possible to run a separate process running as a worker that listens to messages containing data changes and stores them separately:</p><pre><code class="language-sql">LISTEN data_changes;</code></pre><h3 id="downsides-1">Downsides</h3><ul><li>&#x201C;At most once&#x201D; delivery<strong>.</strong> Listen/notify notifications are not persisted meaning if a listener disconnects, it may miss updates that happened before it reconnected again.</li><li>Payload size limit. Listen/notify messages have a maximum payload size of 8000 bytes by default. For larger payloads, it is recommended to store them in the DB audit table and send only references of the records.</li><li>Debugging. Troubleshooting issues related to triggers and listen/notify in a production environment can be challenging due to their asynchronous and distributed nature.</li></ul><h2 id="application-level-tracking">Application-Level Tracking</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*JfwZ5-Usnhbim50P6BudTg.png" class="kg-image" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking" loading="lazy" width="2400" height="976"><figcaption><span style="white-space: pre-wrap;">Application-level tracking with a PostgreSQL audit&#xA0;table</span></figcaption></figure><p>If you have control over the codebase that connects and makes data changes in a PostgreSQL database, then one of the following options is also available to you:</p><ul><li>Manually record all data changes when issuing <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> queries</li><li>Use existing open-source libraries that integrate with popular ORMs</li></ul><p>For example, there is <a href="https://github.com/paper-trail-gem/paper_trail?ref=blog.bemi.io" rel="noopener">paper_trail</a> for Ruby on Rails with ActiveRecord and <a href="https://github.com/jazzband/django-simple-history?ref=blog.bemi.io" rel="noopener">django-simple-history</a> for Django. At a high level, they use callbacks or middlewares to insert additional records into an audit table. Here is a simplified example written in Ruby:</p><pre><code class="language-ruby">class User &lt; ApplicationRecord
  after_commit :track_data_changes

  private

  def track_data_changes
    AuditRecord.create!(auditable: self, changes: changes)
  end
end</code></pre><p>On the application level, <a href="https://martinfowler.com/eaaDev/EventSourcing.html?ref=blog.bemi.io" rel="noopener">Event Sourcing</a> can also be implemented with an append-only log as the source of truth. But it&#x2019;s a separate, big, and exciting topic that deserves a separate blog post.</p><h3 id="downsides-2">Downsides</h3><ul><li>Reliability. Application-level data change tracking is not as accurate as database-level change tracking. For example, data changes made outside an app will not be tracked, developers may accidentally skip callbacks, or there could be data inconsistencies if a query changing the data has succeeded but a query inserting an audit record failed.</li><li>Performance. Manually capturing changes and inserting them in the database via callbacks leads to both runtime application and database overhead.</li><li>Scalability. These audit tables are usually stored in the same database and can quickly become unmanageable, which can require separating the storage, implementing declarative partitioning, and continuous archiving.</li></ul><h2 id="change-data-capture">Change Data&#xA0;Capture</h2><p><a href="https://en.wikipedia.org/wiki/Change_data_capture?ref=blog.bemi.io" rel="noopener">Change Data Capture</a> (CDC) is a pattern of identifying and capturing changes made to data in a database and sending those changes to a downstream system. Most often it is used for <a href="https://en.wikipedia.org/wiki/Extract,_transform,_load?ref=blog.bemi.io" rel="noopener">ETL</a> to send data to a data warehouse for analytical purposes.</p><p>There are multiple approaches to implementing CDC. One of them, which doesn&#x2019;t intersect with what we have already discussed, is a log-based CDC. With PostgreSQL, it is possible to connect to the <a href="https://www.postgresql.org/docs/current/wal-intro.html?ref=blog.bemi.io" rel="noopener">Write-Ahead Log</a> (WAL) that is used for data durability, recovery, and replication to other instances.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*yN-xiW9G5u6Ghrt7_Xv0uQ.png" class="kg-image" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking" loading="lazy" width="2412" height="832"><figcaption><span style="white-space: pre-wrap;">CDC with PostgreSQL logical replication</span></figcaption></figure><p>PostgreSQL supports two types of replications: physical replication and logical replication. The latter allows decoding WAL changes on a row level and filtering them out, for example, by table name. This is exactly what we need to implement data change tracking with CDC.</p><p>Here are the basic steps necessary for retrieving data changes by using logical replication:</p><p>1. Set <code>wal_level</code> to <code>logical</code> in <code>postgresql.conf</code> and restart the database.</p><p>2. Create a publication like a &#x201C;pub/sub channel&#x201D; for receiving data changes:</p><pre><code class="language-sql">CREATE PUBLICATION my_publication FOR ALL TABLES;</code></pre><p>3. Create a logical replication slot like a &#x201C;cursor position&#x201D; in the WAL:</p><pre><code class="language-sql">SELECT * FROM pg_create_logical_replication_slot(&apos;my_replication_slot&apos;, &apos;wal2json&apos;);</code></pre><p>4. Fetch the latest unread changes:</p><pre><code class="language-sql">SELECT * FROM pg_logical_slot_get_changes(&apos;my_replication_slot&apos;, NULL, NULL);</code></pre><p>To implement log-based CDC with PostgreSQL, I would recommend using the existing open-source solutions. The most popular one is <a href="https://github.com/debezium/debezium?ref=blog.bemi.io" rel="noopener">Debezium</a>.</p><h3 id="downsides-3">Downsides</h3><ul><li>Limited context. PostgreSQL WAL contains only low-level information about row changes and doesn&#x2019;t include information about an SQL query that triggered the change, information about a user, or any application-specific context.</li><li>Complexity. Implementing CDC adds a lot of system complexity. This involves running a server that connects to PostgreSQL as a replica, consumes data changes, and stores them somewhere.</li><li>Tuning. Running it in a production environment may require a deeper understanding of PostgreSQL internals and properly configuring the system. For example, periodically flushing the position for a replication slot to reclaim WAL disk space.</li></ul><h2 id="integrated-change-data-capture">Integrated Change Data&#xA0;Capture</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*WYXhO7O5a3bX-EQQtcaJ9Q.png" class="kg-image" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking" loading="lazy" width="2400" height="832"><figcaption><span style="white-space: pre-wrap;">Integrated CDC with application context</span></figcaption></figure><p>To overcome the challenge of limited information about data changes stored in the WAL, we can use a clever approach of passing additional context to the WAL directly.</p><p>Here is a simple example of passing additional context on row changes:</p><pre><code class="language-sql">CREATE OR REPLACE FUNCTION if_modified_func() RETURNS TRIGGER AS $body$
BEGIN
  PERFORM pg_logical_emit_message(true, &apos;my_message&apos;, &apos;ADDITIONAL_CONTEXT&apos;);

  IF (TG_OP = &apos;DELETE&apos;) THEN
    RETURN OLD;
  ELSE
    RETURN NEW;
  END IF;
END;
$body$
LANGUAGE plpgsql;

CREATE TRIGGER my_table_if_modified_trigger
AFTER INSERT OR UPDATE OR DELETE ON my_table
FOR EACH ROW EXECUTE PROCEDURE if_modified_func();</code></pre><p>Notice the <code>pg_logical_emit_message</code> function that was added to PostgreSQL as an internal function for plugins. It allows namespacing and emitting messages that will be stored in the WAL. Reading these messages became possible with the standard logical decoding plugin <code>pgoutput</code> since PostgreSQL v14.</p><p>There is an open-source project called <a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io" rel="noopener">Bemi</a> which allows tracking not only low-level data changes but also reading any custom context with CDC and stitching everything together. Full disclaimer, I&#x2019;m one of the core contributors.</p><p>For example, it can integrate with popular ORMs and adapters to pass application-specific context with all data changes:</p><pre><code class="language-js">import { setContext } from &quot;@bemi-db/prisma&quot;;
import express, { Request } from &quot;express&quot;;

const app = express();

app.use(
  // Customizable context
  setContext((req: Request) =&gt; ({
    userId: req.user?.id,
    endpoint: req.url,
    params: req.body,
  }))
);</code></pre><h3 id="downsides-4">Downsides</h3><ul><li>Complexity and tuning related to implementing CDC.</li></ul><p>If you need a ready-to-use cloud solution that can be integrated and connected to PostgreSQL in a few minutes, check out <a href="https://bemi.io/?ref=blog.bemi.io" rel="noopener">bemi.io</a>.</p><h2 id="conclusion">Conclusion</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://cdn-images-1.medium.com/max/1600/1*4zoFllSTxsWqqjI4zP8cWw.png" class="kg-image" alt="The Ultimate Guide to PostgreSQL Data Change&#xA0;Tracking" loading="lazy" width="2405" height="1264"><figcaption><span style="white-space: pre-wrap;">PostgreSQL data change tracking approach comparison</span></figcaption></figure><ol><li>If you need basic data change tracking, <strong>triggers with an audit table</strong> are a great initial solution.</li><li><strong>Triggers with listen/notify</strong> are a good option for simple testing in a development environment.</li><li>If you value application-specific context (information about a user, API endpoint, etc.) over reliability, you can use <strong>application-level tracking</strong>.</li><li><strong>Change Data Capture</strong> is a good option if you prioritize reliability and scalability as a unified solution that can be reused, for example, across many databases.</li><li>Finally, <strong>integrated Change Data Capture </strong>is your best bet if you need a robust data change tracking system that can also be integrated into your application. Go with <a href="https://bemi.io/?ref=blog.bemi.io" rel="noopener">bemi.io</a> if you need a cloud-managed solution.</li></ol>]]></content:encoded></item><item><title><![CDATA[From Black Box to Open Source: Embracing Transparency]]></title><description><![CDATA[Bemi, a platform for real-time data tracking, announces it's open-sourcing its code to build trust, expand functionality, and contribute to the developer community. This transparency empowers users, attracts diverse perspectives, and fosters collaboration within the developer ecosystem.]]></description><link>https://blog.bemi.io/from-black-box-to-open-source-embracing-transparency/</link><guid isPermaLink="false">65d97c09e08447000140ff1a</guid><category><![CDATA[Announcement]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Fri, 09 Feb 2024 12:00:00 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/02/_-1.webp" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/02/_-1.webp" alt="From Black Box to Open Source: Embracing Transparency"><p>Today, we&#x2019;re thrilled to announce that we&#x2019;re <a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io" rel="noreferrer">open-sourcing Bemi</a>! ? This is a fundamentally different approach to company building and we&apos;ll explain why this is the right decision for us.</p><p>For context &#x2014; Bemi is a platform to automatically track contextualized PostgreSQL data changes and allows devs to leverage real-time data in their applications.</p><h2 id="building-trust">Building trust</h2><p>At the heart of our decision is the desire to build trust. We&apos;re committed to eliminating data black boxes by providing a direct line of sight into the inner workings of our platform.</p><p>Users sometimes express concerns about access to specific data. To address this, we can now easily point to the code and affirm that unless a table or column is explicitly specified, it remains unseen. This tangible proof ensures users that their data is handled as intended and instills confidence in our cloud offering.</p><figure class="kg-card kg-image-card"><img src="https://blog.bemi.io/content/images/2024/02/1-2.webp" class="kg-image" alt="From Black Box to Open Source: Embracing Transparency" loading="lazy" width="2000" height="1050" srcset="https://blog.bemi.io/content/images/size/w600/2024/02/1-2.webp 600w, https://blog.bemi.io/content/images/size/w1000/2024/02/1-2.webp 1000w, https://blog.bemi.io/content/images/size/w1600/2024/02/1-2.webp 1600w, https://blog.bemi.io/content/images/2024/02/1-2.webp 2000w" sizes="(min-width: 720px) 720px"></figure><h2 id="the-long-tail">The long tail</h2><p>Transparency comes with a greater surface area for feedback, making open sourcing a key approach for expanding our software&apos;s functionality. It invites a diverse range of perspectives, ensuring compatibility with different systems, and uncovers fresh applications&#x2014;especially important for more generic infrastructure or database software.</p><p>Open source acts as a catalyst, guiding us towards a broader range of capabilities that meet the current diverse needs of developers. An example of this is the increasing popularity of PostgreSQL and MySQL over proprietary Oracle and MSSQL database incumbents.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/02/2-2.webp" class="kg-image" alt="From Black Box to Open Source: Embracing Transparency" loading="lazy" width="2000" height="861" srcset="https://blog.bemi.io/content/images/size/w600/2024/02/2-2.webp 600w, https://blog.bemi.io/content/images/size/w1000/2024/02/2-2.webp 1000w, https://blog.bemi.io/content/images/size/w1600/2024/02/2-2.webp 1600w, https://blog.bemi.io/content/images/2024/02/2-2.webp 2000w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://db-engines.com/en/ranking_osvsc?ref=blog.bemi.io"><span style="white-space: pre-wrap;">https://db-engines.com/en/ranking_osvsc</span></a></figcaption></figure><h2 id="giving-back-to-the-community">Giving back to the community</h2><p>We&#x2019;re built on top of open source giants like <a href="https://github.com/debezium/debezium?ref=blog.bemi.io" rel="noreferrer">Debezium</a> and <a href="https://github.com/nats-io/nats-server?ref=blog.bemi.io" rel="noreferrer">NATS</a>. Open sourcing is our way of reciprocating the support we&apos;ve received and giving back to the developer community.</p><p>At Bemi, we&#x2019;re developer obsessed and we want to give the best possible developer experience we can. This means no vendor lock-in and ensuring our libraries are easily accessible. This is our contribution to nurturing collaboration within the developer community. Who knows, maybe one day there&apos;ll be tools built on top of what we&apos;ve built.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.bemi.io/content/images/2024/02/3-1.webp" class="kg-image" alt="From Black Box to Open Source: Embracing Transparency" loading="lazy" width="2000" height="2540" srcset="https://blog.bemi.io/content/images/size/w600/2024/02/3-1.webp 600w, https://blog.bemi.io/content/images/size/w1000/2024/02/3-1.webp 1000w, https://blog.bemi.io/content/images/size/w1600/2024/02/3-1.webp 1600w, https://blog.bemi.io/content/images/2024/02/3-1.webp 2000w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://xkcd.com/2347/?ref=blog.bemi.io"><span style="white-space: pre-wrap;">https://xkcd.com/2347/</span></a></figcaption></figure><h2 id="exploring-paths-in-open-source">Exploring paths in open source</h2><p>Companies embrace open source for various other reasons. Some, like <a href="https://supabase.com/?ref=blog.bemi.io" rel="noreferrer">Supabase</a>, use it as a key differentiator, positioning themselves as the open-source Firebase alternative. Others, like <a href="https://www.comma.ai/?ref=blog.bemi.io" rel="noreferrer">CommaAI</a>, use it to ignite diverse applications and innovations through encouraged repository forking.</p><p>There&#x2019;s also clearly merit to other approaches to company building as well, evident in the success on each side, such as Gitlab vs. Github or <a href="https://www.vox.com/technology/2023/7/28/23809028/ai-artificial-intelligence-open-closed-meta-mark-zuckerberg-sam-altman-open-ai?ref=blog.bemi.io" rel="noreferrer">Meta AI vs. OpenAI</a>. Open sourcing isn&apos;t a one-size-fits-all strategy, but for us, it emboldens our company vision and goals.</p><h2 id="looking-forward">Looking forward</h2><p>We want to keep focusing on building the best products with our users, and not in isolation. We&#x2019;re looking forward to the start of our open source journey ?, check out and&#xA0;star&#xA0;<a href="https://github.com/BemiHQ/bemi?ref=blog.bemi.io" rel="noreferrer">our GitHub</a>&#xA0;to stay in the loop on updates!</p>]]></content:encoded></item><item><title><![CDATA[Reducing Event Sourcing Complexity to Boost Product Velocity]]></title><description><![CDATA[A pragmatic approach to getting the benefits of event sourcing without hindering developer velocity.]]></description><link>https://blog.bemi.io/reducing-event-sourcing-complexity-to-boost-product-velocity/</link><guid isPermaLink="false">65d97b67e08447000140ff0a</guid><category><![CDATA[Guide]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Tue, 30 Jan 2024 12:00:00 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/02/_.webp" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/02/_.webp" alt="Reducing Event Sourcing Complexity to Boost Product Velocity"><p>I won&#x2019;t go into great detail about why <a href="https://chriskiehl.com/article/event-sourcing-is-hard?ref=blog.bemi.io" rel="noreferrer">event sourcing is hard</a>. Generally, it represents a significant paradigm shift from the way typical CRUD-like applications are built and introduces high technical complexity. Especially for startups, opting for this architecture comes with a big cost since it slows down how fast developers are able to ship.</p><figure class="kg-card kg-image-card"><img src="https://blog.bemi.io/content/images/2024/02/1-1.webp" class="kg-image" alt="Reducing Event Sourcing Complexity to Boost Product Velocity" loading="lazy" width="1270" height="668" srcset="https://blog.bemi.io/content/images/size/w600/2024/02/1-1.webp 600w, https://blog.bemi.io/content/images/size/w1000/2024/02/1-1.webp 1000w, https://blog.bemi.io/content/images/2024/02/1-1.webp 1270w" sizes="(min-width: 720px) 720px"></figure><h2 id="achieving-benefits-without-slowing-down">Achieving benefits without slowing down</h2><p>At the core of event sourcing is the event log - a record of immutable facts that document every change to an application&apos;s state. Why does this matter? Because sometimes, knowing just the current app state isn&apos;t enough; <a href="https://martinfowler.com/eaaDev/EventSourcing.html?ref=blog.bemi.io" rel="noreferrer">we want to know how we got there</a>.</p><p>A pragmatic approach to getting the advantages of event sourcing is by recording an event log after data changes have already taken place. So you build your application just like developers are used to, but with a bit of extra functionality added around write operations. This way, you get the best of both worlds &#x2013; an append-only log of state changes without sacrificing product velocity! This functionality can be added within the application or at the database level.</p><h3 id="application-level-tracking">Application-level tracking</h3><p>Writing some application code to track data changes is the simplest approach, but comes with some drawbacks. Common libraries like <a href="https://github.com/paper-trail-gem/paper_trail?ref=blog.bemi.io" rel="noreferrer">paper_trail</a> and <a href="https://github.com/jazzband/django-simple-history?ref=blog.bemi.io" rel="noreferrer">django-simple-history</a> use callbacks for additional inserts during write operations. Apart from introducing runtime performance overhead, this approach compromises reliability since updates made outside the app stack aren&apos;t captured.</p><h3 id="database-level-tracking">Database-level tracking</h3><p>Tracking data history at the database layer is the most reliable. In PostgreSQL, this can be done with <a href="https://www.pgaudit.org/?ref=blog.bemi.io" rel="noreferrer">PGAudit</a>, <a href="https://wiki.postgresql.org/wiki/Audit_trigger?ref=blog.bemi.io" rel="noreferrer">Audit Triggers</a>, or using a pattern called <a href="https://en.wikipedia.org/wiki/Change_data_capture?ref=blog.bemi.io" rel="noreferrer">Change Data Capture</a> (CDC).</p><p><strong>PGAudit</strong>: Sends detailed audit logs to the standard PostgreSQL output logs, but doesn&apos;t record events to a table.</p><p><strong>Audit Triggers</strong>: Records changes to an audit log table, but runs synchronously in a transaction, impacting the primary DB instance&apos;s performance.</p><p><strong>CDC:</strong> Recommended for scalability, asynchronously captures data changes by plugging into Postgres <a href="https://www.postgresql.org/docs/current/wal-intro.html?ref=blog.bemi.io" rel="noreferrer">Write-Ahead Logs</a> (WAL).</p><figure class="kg-card kg-image-card"><img src="https://blog.bemi.io/content/images/2024/02/2-1.webp" class="kg-image" alt="Reducing Event Sourcing Complexity to Boost Product Velocity" loading="lazy" width="2000" height="978" srcset="https://blog.bemi.io/content/images/size/w600/2024/02/2-1.webp 600w, https://blog.bemi.io/content/images/size/w1000/2024/02/2-1.webp 1000w, https://blog.bemi.io/content/images/size/w1600/2024/02/2-1.webp 1600w, https://blog.bemi.io/content/images/2024/02/2-1.webp 2400w" sizes="(min-width: 720px) 720px"></figure><p>Although CDC is the generally preferred option, it still has drawbacks &#x2014; it&apos;s the least simple to implement and lacks application context (where, who, how) behind a change. You can check out <a href="https://docs.bemi.io/?ref=blog.bemi.io#architecture-overview" rel="noreferrer">these architecture docs</a> to see how we overcame these challenges at Bemi.</p><p>Subscribe to stay posted about the next blog, where I&apos;ll explain our architecture in greater detail!</p>]]></content:encoded></item><item><title><![CDATA[Why Your Startup Needs a Reliable Source of Truth for Customer Activity]]></title><description><![CDATA[Building data change observability can save valuable engineering hours, empower operations teams, and prevent customer churn.]]></description><link>https://blog.bemi.io/why-your-startup-needs-a-reliable-source-of-truth-for-customer-activity/</link><guid isPermaLink="false">65d975b17e9b2f000176dfe4</guid><category><![CDATA[Guide]]></category><dc:creator><![CDATA[Arjun Lall]]></dc:creator><pubDate>Fri, 19 Jan 2024 12:00:00 GMT</pubDate><media:content url="https://blog.bemi.io/content/images/2024/02/3.webp" medium="image"/><content:encoded><![CDATA[<img src="https://blog.bemi.io/content/images/2024/02/3.webp" alt="Why Your Startup Needs a Reliable Source of Truth for Customer Activity"><p>In the fast-paced world of operationally heavy startups, investing in a comprehensive source of truth for all customer activity can yield unparalleled returns. Imagine saving valuable engineering hours, empowering your operations team, and preventing customer churn &#x2013; it&apos;s all within reach.</p><h2 id="why-do-you-need-this">Why Do You Need This?</h2><p>Consider Joe from Customer Success faced with a customer inquiry about a delayed shipment. Without a historical understanding of customer activity, Joe struggles to determine whether it&apos;s a bug on the platform or an account configuration change that the customer made. Joe pings engineering to investigate. An engineer tries to understand the story of what happened by piecing together information from various logs and database records, and relays that information to Joe, who then relays it to the customer.</p><p>This isn&apos;t just a hypothetical situation. In a recent conversation with a CTO, they recounted an incident where a customer wrongly blamed their platform for a failed email campaign, resulting in a multimillion-dollar deal lost for the customer. The CTO, personally diving into the investigation, discovered that their platform wasn&apos;t the culprit and had sent everything correctly. Instances like these are widespread and highlight the tangible impact of a lack of tooling in this area &#x2013; not just in hours saved but also in valuable customer relationships safeguarded.</p><h2 id="the-solution-data-change-observability">The Solution: Data Change Observability</h2><p>Building a system that is able to reliably store and query data changes is key. Here are some options:</p><h3 id="open-source-libraries">Open Source Libraries</h3><p>Using a gem like <a href="https://github.com/paper-trail-gem/paper_trail?ref=blog.bemi.io" rel="noreferrer">paper_trail</a> in Rails or <a href="https://github.com/jazzband/django-simple-history?ref=blog.bemi.io" rel="noreferrer">django-simple-history</a> in Django is the easiest approach and should cover most basic use cases. This is ideal for simplicity but may lack reliability and performance. Since they&#x2019;re installed on the application layer, you&apos;d miss out on recording edge cases like updates made via direct SQL queries. There&apos;s some runtime performance overhead since the libraries make extra database inserts in callbacks. This shouldn&apos;t be a problem except for larger operations, where even the cost and performance of the table size increases can start to become a concern.</p><h3 id="event-sourcing">Event Sourcing</h3><p>Building an <a href="https://martinfowler.com/eaaDev/EventSourcing.html?ref=blog.bemi.io" rel="noreferrer">event sourced</a> system is a comprehensive solution but is a significant paradigm shift from the way CRUD-like applications are typically built. If an application isn&#x2019;t built with this architecture from the start, it can mean a large rewrite. This is likely not practical for most businesses.</p><h3 id="financial-ledger">Financial Ledger</h3><p>For fintech&#x2019;s, there&#x2019;s also the option of a financial <a href="https://en.wikipedia.org/wiki/Ledger?ref=blog.bemi.io" rel="noreferrer">ledger</a>. This can be built in-house to track payments and account balances, or there are countless ledger-as-a-service offerings that can be used.</p><h3 id="direct-integration-with-database-logs">Direct Integration with Database Logs</h3><p>Some companies opt for direct integration with a database e.g. PostgreSQL&#x2019;s <a href="https://www.postgresql.org/docs/current/wal-intro.html?ref=blog.bemi.io" rel="noreferrer">Write-Ahead Logs</a> for reliability, capturing everything via <a href="https://en.wikipedia.org/wiki/Change_data_capture?ref=blog.bemi.io" rel="noreferrer">CDC</a> at the database level. However, the records would lack application context like the &apos;where&apos; (API endpoint, worker, etc.), &apos;who&apos; (user, cron job, etc.), and &apos;how&apos; behind a change.</p><h3 id="hybrid-approachbemi">Hybrid Approach - Bemi</h3><p><a href="https://bemi.io/use-case/ops?ref=blog.bemi.io" rel="noreferrer">Bemi</a> takes a hybrid approach, integrating with both the database and application layers. While Bemi&apos;s architecture might seem complex, it ensures zero performance penalty, 100% reliability, and an enhanced understanding of each change. It&apos;s designed to be extremely simple for a user; however, we even go as far as to claim &#x201C;full data history enabled in under a minute&#x201D;.</p><figure class="kg-card kg-image-card"><img src="https://blog.bemi.io/content/images/2024/02/1.webp" class="kg-image" alt="Why Your Startup Needs a Reliable Source of Truth for Customer Activity" loading="lazy" width="2000" height="1291" srcset="https://blog.bemi.io/content/images/size/w600/2024/02/1.webp 600w, https://blog.bemi.io/content/images/size/w1000/2024/02/1.webp 1000w, https://blog.bemi.io/content/images/size/w1600/2024/02/1.webp 1600w, https://blog.bemi.io/content/images/size/w2400/2024/02/1.webp 2400w" sizes="(min-width: 720px) 720px"></figure><h3 id="user-experience">User Experience</h3><p>Building a user-friendly interface is another consideration. Most use cases would be covered by an internal dashboard showcasing each customer&#x2019;s activity logs with some basic filtering functionality for Joe to be able to drill down on the relevant entity that he&#x2019;s trying to troubleshoot. Bemi goes a step further, leveraging AI to transform complex data changes into human-readable logs, making it more accessible even to non-technical users like Joe.</p><figure class="kg-card kg-image-card"><img src="https://blog.bemi.io/content/images/2024/02/2.webp" class="kg-image" alt="Why Your Startup Needs a Reliable Source of Truth for Customer Activity" loading="lazy" width="2000" height="1385" srcset="https://blog.bemi.io/content/images/size/w600/2024/02/2.webp 600w, https://blog.bemi.io/content/images/size/w1000/2024/02/2.webp 1000w, https://blog.bemi.io/content/images/size/w1600/2024/02/2.webp 1600w, https://blog.bemi.io/content/images/2024/02/2.webp 2400w" sizes="(min-width: 720px) 720px"></figure><p>The degree to which this is a problem and the ideal solution vary among companies. Please share how you&#x2019;ve solved this in the past! if you&#x2019;re dealing with this currently, feel free to <a href="https://calendly.com/arjun-lall/30min?ref=blog.bemi.io" rel="noreferrer"><strong>schedule a chat</strong></a>, and I can share some additional tips along the way.</p><p>By building a reliable source of truth for customer activity, you&apos;re not just saving hours troubleshooting &#x2013; you&apos;re future-proofing your operations and setting the stage for sustainable growth.</p>]]></content:encoded></item></channel></rss>