In Wireshark and MongoDB 3.6, I explained that Wireshark is amazing for debugging actual network communications. But sometimes it is necessary to debug things before they get sent out onto the wire. The majority of the driver's communication with the server is through BSON documents with minimal overhead of wire protocol messages. BSON documents are represented in the C Driver by
bson_t data structures. The
bson_t structure wraps all of the different data types from the BSON Specification. It is analogous to PHP's
zval structure, although its implementation is a little more complicated.
bson_t structure can be allocated on the stack or heap, just like a
zval structure. A
zval structure represents a single data type and single value. A
bson_t structure represents a buffer of bytes constituting one or more values in the form of a BSON document. This buffer is exactly what the MongoDB server expects to be transmitted over a network connection. As many BSON documents are small, the
bson_t structure can function in two modes, determined by a flag:
inline mode it only has space for 120 bytes of BSON data, but no memory has to be allocated on the heap. This mode can significantly speed up its creation, especially if it is allocated on the stack (by using
bson_t value, instead of
= bson_new()). It makes sense to have this mode, as many common interactions with the server fall under this 120-byte limit.
zval, the PHP developers have developed a helper function, printzv, that can be loaded into the GDB debugger. This helper function unpacks all the intricacies of the
zval structure (e.g. arrays, objects) and displays them on the GDB console. When working on some code for the MongoDB Driver for PHP, I was looking for something similar for the
bson_t structure only to find that no such thing existed yet. With the
bson_t structure being more complicated (two modes, data as a binary stream of data), it would be just as useful as PHP's
printzv GDB helper. You can guess already that, of course, I felt the need to just write one myself.
GDB supports extensions written in Python, but that functionality is sometimes disabled. It also has its own scripting language that you can use on its command line, or by loading your own files with the
source command. You can define functions in the language, but the functions can't return values. There are also no classes or scoping, which means all variables are global. With the data stored in the
bson_t struct as a stream of binary data, I ended up writing a GDB implementation of a streamed BSON decoder, with a lot of handicaps.
printbson function accepts a
bson_t * value, and then determines whether its mode is
allocated. Depending on the allocation type,
printbson then delegates to a "private"
__printbson function with the right parameters describing where the binary stream is stored.
__printbson prints the length of the top-level BSON document and then calls the
_printelements function. This function reads data from the stream until all key/value pairs have been consumed, advancing its internal read pointer as it goes. It can detect that all elements have been read, as each BSON document ends with a null byte character (
If a value contains a nested BSON document, such as the document or array types, it recursively calls
__printelements, and also does some housekeeping to make sure the following output is nicely indented.
Each element begins with a single byte indicating the field type, followed by the field name as a null-terminated string, and then a value. After the type and name are consumed,
__printelements defers to a specialised print function for each type. As an example, for an ObjectID field, it has:
if $type == 0x07 __printObjectID $data end
__printObjectID function is then responsible for reading and displaying the value of the ObjectID. In this case, the value is 12 bytes, which we'd like to display as a hexadecimal string:
define __printObjectID set $value = ((uint8_t*) $arg0) set $i = 0 printf "ObjectID(\"" while $i
It first assigns a value of a correctly cast type (
uint8_t*) to the
$valuevariable, and initialises the loop variable
$i. It then uses a
whileloop to iterate over the 12 bytes; GDB does not have a
forconstruct. At the end of each display function, the
$datapointer is advanced by the number of bytes that the value reader consumed.
For types that use a null-terminated C-string, an additional loop advances
\0character is found. For example, the Regex data type is represented by two C-strings:define __printRegex printf "Regex(\"%s\", \"", (char*) $data # skip through C String while $data != '\0' set $data = $data + 1 end set $data = $data + 1 printf "%s\")", (char*) $data # skip through C String while $data != '\0' set $data = $data + 1 end set $data = $data + 1 end
We start by printing the type name prefix and first string (pattern) using
printfand then advance our data pointer with a
whileloop. Then, the second string (modifiers) is printed with
printfand we advance again, leaving the
$datapointer at the next key/value pair (or our document's trailing null byte if the regex type was the last element).
After implementing all the different data types, I made a PR against the MongoDB C driver, where the BSON library resides. It has now been merged. In order to make use of the .gdbinit file, you can include it in your GDB session with
With the file loaded, and
bson_t *variable in the local scope, you can run
In the future, I might add information about the length of strings, or the convert the predefined types of the Binary data-type to their common name. Happy hacking!