hack.with(passion) – writing better software

September 27, 2021September 27, 2021

Zero is not always truthy in the Rails ecosystem

In Ruby, the numeric 0 is a truthy value, meaning it is converted to true when evaluated in a boolean expression. This is unlike languages such as JavaScript and Python, where the numeric 0 is falsy (ie. evaluates as false in a boolean expression). This is also a departure from the underlying hardware that our programs run on, where a 1 bit represents true and a 0 bit represents false. We can check that 0 is truthy in an irb console by using a double bang to to convert 0 to its boolean value:

~ irb
3.0.2 :001 > !!0
 => true

In Ruby, nil and false are the only falsy values, which is an elegant part of the language’s design. This philosophy of 0 as truthy is seen in some of ActiveSupport’s widely-used methods:

3.0.2 :001 > 0.present?
 => true 
3.0.2 :002 > 0.blank?
 => false

0.present? returns true because it is not blank. 0 is not blank because it is not logically false, and not considered to be “empty” the way " ", { }, and [] are.

However, did you know that this philosophy of “0 as truthy” doesn’t extend to all methods provided by ActiveRecord and ActiveSupport?

For each attribute on an ActiveRecord object, ActiveRecord generates a “query method” that ends with a ? character. This method checks for the presence of that attribute. For example, if we have a model named Person with an attribute called age, ActiveRecord generates a method named age? that behaves like so:

p = Person.new
p.age = nil
p.age? # returns false
p.age = 18
p.age? # returns true

How does .age? behave when age is set to 0?

p.age = 0
p.age? # returns false

This behaviour (0 returning false when being read in a query method) is well-documented in the ActiveRecord::Base documentation, and its source code in the ActiveRecord codebase is easy to follow. In some cases, this can result in more readable code – for example, if we have a BankAccount model with a balance attribute, we could check if it has a non-zero balance by calling the .balance? method. However, it’s a departure from the “0 as truthy” design philosophy, and so it’s an interesting edge case to consider in the Rails ecosystem!

Start your journey towards writing better software, and watch this space for new content.

September 15, 2021September 27, 2021

How can I retry a Sidekiq job without reporting an error?

When a Sidekiq job raises an error, Sidekiq will retry that job up to 25 times following its exponential backoff schedule (or until the job succeeds). However, there are some cases where we want to retry a job, but don’t want to report an error to whatever monitors our Sidekiq errors.¹ Suppose that we have some worker that tries to read a resource from an external API, and that resource may or may not be available yet:

In this case, the most straightforward way to retry without reporting an error is to define some custom error type that we only use to trigger a retry:

It’s important to use a custom error type here, since we don’t want to mask any real errors that could occur in our worker! We can then configure our error monitoring service to ignore our custom ResourceNotYetAvailableSidekiqRetryError. For example, if we’re using Sentry for our error monitoring, we can add this custom error type to config.excluded_exceptions to ignore it.

Sidekiq will then retry our ReadResourceFromExternalAPIWorker according to its exponential backoff schedule,² which we expect to eventually succeed when the external resource becomes available. If we have configured our error monitoring service correctly for ResourceNotYetAvailableSidekiqRetryError, we will not report any errors when we do this retry.

Start your journey towards writing better software, and watch this space for new content.

1: For example, a high error rate on a particular job could trigger some sort of alerting logic in a system like PagerDuty, which is not the desired behaviour for an expected retry!

2: If we don’t want to follow Sidekiq’s default exponential backoff schedule, the enterprise edition of Sidekiq allows us to define a rate limiter with a custom-defined backoff schedule.

June 26, 2021June 27, 2021

How we built an event-driven architecture at Drop

Hey readers!

This past week, I published a blog post titled “How we built an event-driven architecture at Drop” on Drop’s engineering blog. In that post, I explore some of the cool things we’re building at Drop using Kafka and Debezium. Check it out if you’re interested!

-Eric

April 6, 2021June 27, 2021

Be precise when error handling known issues

Suppose we’re writing code that calls some external API, and this external API returns a JSON payload. When we call this API, there’s some nonzero chance that it will return invalid JSON (ie. a malformed response).¹ In the case when we receive a malformed response, we want to raise an exception back up the call stack. The code may look something like so:

Suppose that we run this code in production, and while it works for the happy path where SomeAPIClient returns valid JSON, we find some issue with the external API we’re unable to fix. For example, we may find that due to a bug in the API endpoint that SomeAPIClient calls, SomeAPIClient returns a string called "bar" (ie. a malformed JSON payload) for some fraction of the calls where parameters_hash has a key of :foo.

In this case, we treat this error as a known issue that we’re unable to solve, and so we want to write a workaround to gracefully handle this error. To write a workaround, we need error-handling logic that’s more specific than the default behaviour of raising a JSON::ParserError. We might be tempted to write code that looks like this²:

By raising a InvalidResponseForFoo when we see the malformed response of "bar", the above code handles the known issue with more specificity than simply raising a JSON::ParserError.³ However, the above code is insufficiently specific, and it may mask unrelated bugs that are introduced into our system in the future.

For example, what would happen if SomeAPIClient started returning "bar" intermittently for every value of parameters_hash, not just requests containing a key of :foo? Our application would raise InvalidResponseForFoo errors for requests completely unrelated to the key of :foo, meaning our workaround is executed for a use case it was not designed for.

In this example, a better way to handle the error would be like so:

In the revised error handling code above, we explicitly check if the key :foo exists in our input parameter. This ensures that we’re only raising a InvalidResponseForFoo for the known issue we originally observed.

This example is specific to JSON, but there’s a generalized takeaway from this: if we’re writing workaround code to deal with a known issue, we should be as specific as possible when determining whether or not to execute that workaround. This prevents us from masking future bugs that may look similar to to our known issue, but could be a different issue altogether.

Start your journey towards writing better software, and watch this space for new content.

1: Hint: we should assume this is true for every API call that we make!

2: In this example, we raise a custom error to handle this known issue. However, this isn’t the only way to handle a known issue. For example, we may want to return nil or implement some use-case specific logic.

3: A more efficient approach would be to write raise InvalidResponseForFoo if raw_response == 'bar' before we call JSON.parse(raw_response). This would allow us to only raise one error, and avoid writing an explicit rescue block for JSON::ParserError. However, I found the example to be a bit more readable with the error handling code written at the end of the method definition.

March 30, 2021June 27, 2021

The four replica identities in PostgreSQL: a practical example

Logical replication is a PostgreSQL feature that produces information about database writes and stores it in a write-ahead log, enabling a consumer to subscribe to these changes and consume them in real-time. A replica identity determines what information is published when an UPDATE or DELETE occurs on a row. It has four different options that can be configured: DEFAULT, FULL, NOTHING, and USING INDEX. The behaviour of these four replica identity options are detailed in the PostgreSQL documentation:

This form changes the information which is written to the write-ahead log to identify rows which are updated or deleted. This option has no effect except when logical replication is in use. DEFAULT (the default for non-system tables) records the old values of the columns of the primary key, if any. USING INDEX records the old values of the columns covered by the named index, which must be unique, not partial, not deferrable, and include only columns marked NOT NULL. FULL records the old values of all columns in the row. NOTHING records no information about the old row. (This is the default for system tables.) In all cases, no old values are logged unless at least one of the columns that would be logged differs between the old and new versions of the row.

This is a complete description of the different behaviours for the four replica identity options, but to better illustrate the behaviour, let’s look at a practical example below.

Table definition and database write

The table we’re writing to for this example, users, is defined like so¹:

CREATE TABLE users (
  first_name VARCHAR(255) NOT NULL,
  last_name VARCHAR(255) NOT NULL,
  address VARCHAR (255) NOT NULL,
  age INTEGER NOT NULL,
  PRIMARY KEY (first_name, last_name)
);

Our initial state of this table starts with a single row – a user with a first_name of John, last_name of Smith, address of 123 Fake Street, and age of 30.

We’ll examine the different behaviours for our different replica identities with the following database write, which the user’s first name from John to James:

UPDATE users SET first_name = 'James' WHERE first_name = 'John' AND last_name = 'Smith';

With that context set, let’s look at what is actually written into our write-ahead log for each different replica identity.

`DEFAULT` replica identity

For the DEFAULT replica identity, we see the following information written to the write-ahead log:

table public.users:
  UPDATE:
    old-key:
      first_name[character varying]:'John'
      last_name[character varying]:'Smith'
    new-tuple:
      first_name[character varying]:'James'
      last_name[character varying]:'Smith'
      address[character varying]:'123 Fake Street'
      age[integer]:30

The DEFAULT replica identity records the old values of the complete primary key, even though we only wrote to one column of our primary key.

`FULL` replica identity

We receive the old values for every column in the database when we specify FULL:

table public.users:
  UPDATE:
    old-key:
      first_name[character varying]:'John'
      last_name[character varying]:'Smith'
      address[character varying]:'123 Fake Street'
      age[integer]:30
    new-tuple:
      first_name[character varying]:'James'
      last_name[character varying]:'Smith'
      address[character varying]:'123 Fake Street'
      age[integer]:30

`NOTHING` replica identity

NOTHING only provides us with the new values (ie. provides no old information):

table public.users:
  UPDATE:
    first_name[character varying]:'James'
    last_name[character varying]:'Smith'
    address[character varying]:'123 Fake Street'
    age[integer]:30

`USING INDEX` replica identity

The behaviour of the USING INDEX replica identity depends on what columns our index is defined on. For example, if we used the index users_age (ie. an index on the age column), we’d see the following information:

table public.users:
  UPDATE:
    first_name[character varying]:'James'
    last_name[character varying]:'Smith'
    address[character varying]:'123 Fake Street'
    age[integer]:30

We didn’t update age, so this behaviour is identical to the NOTHING identity in this case. However, what if if we used the index users_first_name (ie. an index on the first_name column) instead?

table public.users:
  UPDATE:
    old-key:
      first_name[character varying]:'John'
    new-tuple:
      first_name[character varying]:'James'
      last_name[character varying]:'Smith'
      address[character varying]:'123 Fake Street'
      age[integer]:30

We see the previous information on first_name because users_first_name indexes on this column. If we defined an index called users_first_name_last_name on the first_name and last_name column, the behaviour would be identical to a DEFAULT replica identity:

table public.users:
  UPDATE:
    old-key:
      first_name[character varying]:'John'
      last_name[character varying]:'Smith'
    new-tuple:
      first_name[character varying]:'James'
      last_name[character varying]:'Smith'
      address[character varying]:'123 Fake Street'
      age[integer]:30

Conclusion

To summarize, replica identity lets us choose which columns we want to publish the old values for when we do an UPDATE or DELETE operation:

DEFAULT specifies the columns specifying the primary key
FULL specifies all columns
NONE specifies no columns
USING INDEX specifies columns that the index is defined on

However, we always receive complete information about the new value of the row, regardless of replica identity.

Start your journey towards writing better software, and watch this space for new content.

1: Using a first name and last name a primary key isn’t a great idea for anything but a toy data model – there are real life examples of multiple people with the same first and last name! However, it lets us define a (semi-believable) primary key that’s a composite of two columns, which is important when describing the DEFAULT replica identity.

March 23, 2021June 27, 2021

Don’t just count dots for the Law of Demeter

The Law of Demeter is a design principle used to develop well-encapsulated software. To hide implementation details between classes, the Law of Demeter suggests the following guidelines (taken from the linked Wikipedia article):

• Each unit should have only limited knowledge about other units: only units “closely” related to the current unit.
• Each unit should only talk to its friends; don’t talk to strangers.
• Only talk to your immediate friends.

For example, consider following classes A, B, and C:

If we wanted to retrieve the string 'foo' from an instance of A, the following code snippet would break the Law of Demeter:

a = A.new
bar = a.b.c.foo

By accessing instances of B and C via a, we’re exposing internal implementation details of the class A, and so this program has violated the Law of Demeter. A more encapsulated design defines our classes¹ like so:

With our new design, we can retrieve 'foo' from an instance of A like so:

a = A.new
bar = a.foo

The above code snippet can retrieve the string 'foo' from an instance of A without knowing the internal implementation details of A (namely, that it refers to B and C when retrieving that string).

The Law of Demeter is a valuable design principle, but for simplicity and conciseness, it’s sometimes stated as “use only one dot.” If we see multiple dots when reading a value, it suggests that we’re accessing some object’s internal implementation details. a.b.c.foo has three dots (ie. bad encapsulation), while a.foo only has one (ie. good encapsulation), so the heuristic works for the above example.

However, dot counting is just a heuristic, and one that we can’t apply blindly. For example, what if we called the following code on an instance of C from the previous code snippet?

c = C.new
bar = c.foo.upcase.reverse # has value of 'OOF'

c.foo.upcase.reverse has the same number of dots as a.b.c.foo – however, does it violate the Law of Demeter?

No, not at all. c.foo returns a string, a built-in Ruby type. .upcase and .reverse are methods on that built-in type which also return strings. There’s no encapsulation that’s being broken in this example, since we’re just doing transformations on a string.

A better way to apply the “dot counting” heuristic is to:

check if there are multiple dots when reading a value
if it’s greater than one, check how many (non built-in) different classes are being accessed after each subsequent dot
if that number is greater than one, there may be a violation of the Law of Demeter

Heuristics are useful, but context matters when applying them.

Start your journey towards writing better software, and watch this space for new content.

1: For the sake of clarity, I wrote a foo method on each class in this example. However, Ruby on Rails’ delegate feature is a more concise and idiomatic way to achieve the same functionality.

March 17, 2021June 27, 2021

Can we write “interrobang” methods in Ruby on Rails?

By convention in Ruby on Rails applications, methods ending with a ? character return a boolean, while methods ending with a ! character may modify some state or raise an exception. These are referred to as boolean methods and bang methods:

However, what if we wanted to write a method that modified some state and returned a boolean? For example, the SETNX method in Redis will attempt to set a key, but will only succeed if the key is not currently set. This method returns true or false based on if it succeeded. If we were to write a wrapper around the SETNX method, we might try to name it something like modify_some_state!? to communicate that it may have modified some state and returned a boolean.

Does Ruby on Rails allow us to write a method name using !? as its ending? For example, what happens if we try to run this code?

If we try to load Foo in the Rails console, we see the following error:

$ rails c
Running via Spring preloader in process 15120
Loading development environment (Rails 6.1.3)
2.5.1 :001 > Foo.new
Traceback (most recent call last):
        1: from (irb):1
SyntaxError (/Users/eric/some_app/app/models/foo.rb:2: syntax error, unexpected '?', expecting ';' or '\n')
    def modify_some_state!?

Unfortunately, this results in a syntax error in our application, since we can only use a ! or ? character at the end of a method name.

However, Ruby allows us to use non-ASCII characters in our method names! The interrobang (represented as ‽) is a punctuation mark that combines the exclamation and question marks. If we really wanted to, we could use the ‽ Unicode character in a Ruby method name. The following Foo definition is valid Ruby code:

However, this code has the potential to be extremely confusing and unmaintainable. The ‽ character looks similar to a question mark, and doesn’t appear on a standard QWERTY keyboard. So, even if the above naming convention is possible, I would not recommend actually using it when writing Ruby code 🙂

Here’s my recommendation on naming a method that returns a boolean and modifies state: if you’re ending your method with ?, use the rest of the method name to make it clear that you’re modifying some state. If you’re ending your method with !, use the rest of the method name to make it clear that you’re returning a boolean.

Returning to the SETNX wrapper example, two potential names we could use for that method are:

successfully_set_key?: the ? communicates that we’re returning a boolean, but the set verb also shows that we’re modifying some state.
set_key_if_empty!: the ! communicates that we’re modifying some state, but the conditional if implies some boolean value associated with the result of the method.

Selecting a ? or ! suffix is also dependent on how the method will be used within the program. A method used as a guard clause, for example, should end with a ?. A method whose result won’t affect the control flow, on the other hand, may be better suited to ending with a !.

Start your journey towards writing better software, and watch this space for new content.

March 9, 2021June 27, 2021

Prefer .find_by to .find in ActiveRecord

ActiveRecord is the object-relational mapping (ORM) library used in Ruby on Rails. .find and .find_by are two of its methods that, given an ID, can query a record from the database for a given model.¹ At first glance, .find and .find_by look pretty similar. In this blog post, however, we’ll discuss the readability benefits of .find_by over .find when we want to query by ID.

Suppose that we’re using our Rails console, and we want to read an instance of Foo that has an ID of 1 from the database. Using .find and .find_by looks like so:

2.5.1 :001 > Foo.find(1)
  Foo Load (0.1ms)  SELECT "foos".* FROM "foos" WHERE "foos"."id" = ? LIMIT ?  [["id", 1], ["LIMIT", 1]]
 => #<Foo id: 1, created_at: "2021-03-07 23:55:46.994305000 +0000", updated_at: "2021-03-07 23:55:46.994305000 +0000"> 

2.5.1 :002 > Foo.find_by(id: 1)
  Foo Load (0.1ms)  SELECT "foos".* FROM "foos" WHERE "foos"."id" = ? LIMIT ?  [["id", 1], ["LIMIT", 1]]
 => #<Foo id: 1, created_at: "2021-03-07 23:55:46.994305000 +0000", updated_at: "2021-03-07 23:55:46.994305000 +0000">

In .find, we pass in 1 as a positional argument, while in .find_by, we pass 1 in as a keyword argument for id.

The generated SQL is identical for .find and .find_by, and both returned the same object. However, what happens if we try to query a Foo that doesn’t exist? Suppose we don’t have a Foo with the ID of 100 in our database, and we try to query it:

2.5.1 :003 > Foo.find(100)
  Foo Load (0.1ms)  SELECT "foos".* FROM "foos" WHERE "foos"."id" = ? LIMIT ?  [["id", 100], ["LIMIT", 1]]
Traceback (most recent call last):
        1: from (irb):16
ActiveRecord::RecordNotFound (Couldn't find Foo with 'id'=100)

2.5.1 :004 > Foo.find_by(id: 100)
  Foo Load (0.1ms)  SELECT "foos".* FROM "foos" WHERE "foos"."id" = ? LIMIT ?  [["id", 100], ["LIMIT", 1]]
 => nil

.find raises a RecordNotFound exception when a Foo with ID of 100 doesn’t exist. find_by, on the other hand, returns nil without raising an exception – if we did want to raise a RecordNotFound exception, we would call the bang method find_by!:²

2.5.1 :005 > Foo.find_by!(id: 100)
  Foo Load (0.1ms)  SELECT "foos".* FROM "foos" WHERE "foos"."id" = ? LIMIT ?  [["id", 100], ["LIMIT", 1]]
Traceback (most recent call last):
        1: from (irb):27
ActiveRecord::RecordNotFound (Couldn't find Foo)

Based on the examples above, there are two readability advantages to using .find_by, which we’ll discuss below:

First, when we use .find_by, it’s clearer that we’re querying by ID. The examples above clearly stated that 1 and 100 were IDs for our Foo records, but what if we were working with a snippet of code like so?

some_undescriptive_variable_name = 1
Foo.find(some_undescriptive_variable_name)

If we haven’t named our variables well, it may not immediately clear to a reader that we’re querying by ID, especially if our reader is unfamiliar with Ruby on Rails. Foo.find_by(id: some_undescriptive_variable_name), on the other hand, makes this more explicit via its id keyword argument.

Second, .find_by allows us to be more explicit about whether or not we want to raise an exception. From its name alone, non-obvious that .find will raise an exception if it can’t find a record. On the other hand .find_by! follows the bang method convention, s0 it’s more clear that we intend to raise an exception when we call it. And if we don’t want to raise an exception, we can simply call .find_by.

This preference for .find_by over .find only applies to the case where we’re querying on a single ID, since these two methods actually generate different SQL when querying multiple IDs. When we pass in an array of IDs to .find, it returns an array of the matching records, while .find_by would truncate the result to one record:

2.5.1 :006 > Foo.find([1,2])
  Foo Load (0.2ms)  SELECT "foos".* FROM "foos" WHERE "foos"."id" IN (?, ?)  [[nil, 1], [nil, 2]]
 => [#<Foo id: 1, created_at: "2021-03-07 23:55:46.994305000 +0000", updated_at: "2021-03-07 23:55:46.994305000 +0000">, #<Foo id: 2, created_at: "2021-03-08 00:10:12.849400000 +0000", updated_at: "2021-03-08 00:10:12.849400000 +0000">] 

2.5.1 :007 > Foo.find_by(id: [1,2])
  Foo Load (0.2ms)  SELECT "foos".* FROM "foos" WHERE "foos"."id" IN (?, ?) LIMIT ?  [[nil, 1], [nil, 2], ["LIMIT", 1]]
 => #<Foo id: 1, created_at: "2021-03-07 23:55:46.994305000 +0000", updated_at: "2021-03-07 23:55:46.994305000 +0000">

For cases where we want to query a single ActiveRecord record by ID, however, we should prefer .find_by to .find, as this produces more readable code.

Start your journey towards writing better software, and watch this space for new content.

1: This is a non-exhaustive set of methods. .where is another ActiveRecord method that can query by ID.

2: By convention, a bang method in Ruby is expected to modify the object it’s called on, has a state-changing side effect, or will raise an exception.

March 2, 2021June 27, 2021

Double-check your Rails migrations’ auto-generated SQL

Suppose we want to create a join table in a Ruby on Rails application between two models, Foo and Bar. There’s plenty of material available online describing join tables in detail (the Wikipedia page on associative entities is pretty good), but to begin, let’s go over a brief recap and an example:

Join tables implement many-to-many relationships between entities in a relational database. An entry in a join table will hold the primary keys of the entities it associates. For example, the linked Wikipedia article above (at time of writing) shows that we can use a Registration entity to implement a join table, and this join table holds a relationship between a Student and a Course. This Registration entity contains the primary keys of a Student and a Course.

Having seen an example, let’s a join table between a Foo and a Bar. A join table between Foo and Bar will hold the primary key of a Foo and the primary key of a Bar (ie. the IDs of these two models). Searching through the Ruby on Rails documentation, we may find that create_join_table is a method that is available inside of a database migration. To use this method, we’d write a migration like so:

If we run the migration, we do get a working join table in our database. However, if we look at the underlying SQL this migration generates, it might not quite match what we expect to get. In this blog post, I’m using SQLite (the default option provided by Rails), but we would see similar results when using a different database like PostgreSQL. The generated SQL is as follows:

There are two interesting things to note about the above SQL.

Firstly, our join table doesn’t actually use a foreign key constraint. This means that our SQL database will let us insert any integer into the values of foo_id and bar_id, even if such an ID doesn’t exist in our tables for Foo and Bar.

Secondly, the generated SQL has NOT NULL constraints, meaning our join table entries must always have an integer value set for foo_id and bar_id. This makes sense in some, but not all, use cases.

In the Student and Course example from the beginning of the article, NOT NULL fits the use case – it doesn’t make sense to have a Registration with a Course but no Student, or a Registration with a Student but no Course.

However, let’s look at a slightly different use case – suppose we’re trying to keep track of floating software licensing, where we limit the number of users who have permission to use some software at a given time. In this case, we may have a join table named something like LicenseAssignment between a User and a SoftwareApplication. In this case, however, it makes sense that User could be NULL – it would just mean that not all of our licenses are currently being used. In this case, we wouldn’t want a NOT NULL constraint.

As of writing, most hits on the first page of Google for “rails join table tutorial” use database migrations that implement pretty different results from what we see above. I’ve excluded database indexes for brevity, but for the most part, they look like so:

This migration generates the following SQL:

If we use the above SQL, we can rely on our database to enforce some level of data integrity via its foreign key constraints, and we support the use case where a Foo ID or a Bar ID can be NULL.

So, what’s the lesson behind all of this? Double-check your Rails migrations’ auto-generated SQL, and make sure it fits your assumptions and use case. Superficially, the two Rails migrations outlined in the article look similar, but their underlying data models are actually very different.

Start your journey towards writing better software, and watch this space for new content.

February 22, 2021June 27, 2021

This is what causes XXE vulnerabilities

XML external entity (XXE) attacks are one of the OWASP Top Ten security risks. XXE attacks are caused by XML parsers that have entity processing enabled. Here’s an example of a simple Ruby program that has entity processing enabled in Nokogiri, its XML parser:

This allows our XML parser to read the contents of our local filesystem, the key point being that this occurs because the NOENT flag is enabled. When we run the program, we see contents of /etc/passwd (limited to the first 10 lines for brevity):

$ ruby xxe.rb | head -n 10
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY users SYSTEM "file:///etc/passwd">
]>
<root>
    <child>##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by

If we ran the program with NOENT disabled, we’d see the following:

$ ruby xxe.rb | head -n 10
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY users SYSTEM "file:///etc/passwd">
]>
<root>
    <child>&users;</child>
</root>

In this case, we see that there’s still a reference to the users entity, and we haven’t read the contents of our local filesystem.

This raises a question: what does NOENT actually mean?

At first glance, the naming is a bit counterintuitive. NOENT looks like it means something like “no entities,” but we are processing our users external entity when that flag is enabled.

Luckily, we don’t have to search far in Nokogiri’s source code to see how NOENT is used. Nokogiri is partially implemented in Java, and we can find this code snippet in its XmlDomParserContent.java source file:

In the same file, we find FEATURE_NOT_EXPAND_ENTITY defined like so:

To summarize what we’ve discovered so far: when NOENT is enabled, the FEATURE_NOT_EXPAND_ENTITY feature is turned off, and this is when we see our entity expanded with contents from the local filesystem. When NOENT is disabled, the FEATURE_NOT_EXPAND_ENTITY feature is turned on, and we don’t read contents from the local filesystem.

That’s a lot of consecutive negatives! Let’s reword it for clarity: when our flag is enabled, the feature which expands entities is turned on. Put this way, the behaviour is a bit more clear – we see the contents of the local filesystem because our entity-expanding feature is enabled.

Still, this doesn’t answer our original question – why the name NOENT? To answer that, we can look at Apache documentation related to the FEATURE_NOT_EXPAND_ENTITY definition shown previously. Under the definition of the http://apache.org/xml/features/dom/create-entity-ref-nodes feature, we expect the following behaviour when FEATURE_NOT_EXPAND_ENTITY is set to true:

Create EntityReference nodes in the DOM tree. The EntityReference nodes and their child nodes will be read-only.

And when it’s set to false:

Do not create EntityReference nodes in the DOM tree. No EntityReference nodes will be created, only the nodes corresponding to their fully expanded sustitution [sic] text will be created.

In other words, when NOENT is enabled, it means that we don’t expect to see an EntityReference node in our parsed content, and our parser should replace an entity with its definition (in the case of our example, replace the users node with /etc/passwd). If NOENT is disabled, it means we do expect to see our entity in our parsed content, and so we still see a reference to users in the output of our parser.

In conclusion: the NOENT flag does mean “no entities”, as in, “no references to entities should exist in our parsed XML.” This is why our parser replaces it with the contents of /etc/passwd. This naming convention leaves plenty of room for confusion, which is why fixing the names of parser flags is actually on the Nokogiri roadmap!

Start your journey towards writing better software, and watch this space for new content.

Table definition and database write

DEFAULT replica identity

FULL replica identity

NOTHING replica identity

USING INDEX replica identity

Conclusion

`DEFAULT` replica identity

`FULL` replica identity

`NOTHING` replica identity

`USING INDEX` replica identity