Protocol‎ > ‎Design Documents‎ > ‎

Wave store design for Wave in a Box

Authors: Soren Lassen, Joseph Gentle, Frédéric Gobry
Date: November 2010


    This document describes the design of a file-based wave store for Wave in a Box.

    BACKGROUND

    Deltas

    Deltas are Wave's primary units of information exchange and storage. A delta is a sequence of consecutive operations by a single author which apply to one wavelet. They are applied atomically and the delta has an application timestamp issued by the hosting wave server. Clients (including agents, and robots) participate in a wavelet by exchanging deltas via the wavelet's host wave server. The server applies deltas submitted by a participant and publishes the applied deltas to all participants. The state of the wavelet is completely defined by its delta history: the historical sequence of all deltas applied to the wavelet.

    A wave server stores all deltas for the wavelets that it hosts ("local wavelets") and the wavelets hosted elsewhere ("remote wavelets") that its users participate in.

    Apply and commit

    For a local wavelet, the wave server is responsible for two steps when a delta is submitted by a local user or a remote wave server:
    • First, apply the delta: perform operational transformation (OT), validity checking, timestamp, and then publish the applied delta to any local users and remote wave servers who share the wavelet.
    • Second, commit the delta: write the delta to persistent storage and then publish a commit notice to any local users and remote wave servers who share the wavelet.

    NOTE: If the delta is submitted by a local user, the wave server signs the delta before it applies and publishes it. If it is submitted by a remote wave server, the wave server verifies the signature provided by the remote wave server.
    NOTE: The federation protocol, client-server protocol, and wave server architecture would all be simpler if the two steps, delta application and persistence, were combined into one and the wave server only published an applied delta to other users and wave servers after it was persisted. The more complicated two step process was chosen to minimize the latency of propagating operations from the party (client, agent, robot) who generates them to other online parties. In particular, this is intended to enhance the immediacy of the "live typing" user experience in Wave.

    Wavelet snapshots

    The delta history is an impractical representation of the wavelet state in some contexts, because the history can be long and it can be laborious to compose the insertion and deletion operations in the deltas to compute the wavelet contents. If deltas are Wave's primary information units or data type, a useful secondary data type is that of a wavelet snapshot: a direct representation of the wavelet contents at some version (without historical information). Wavelet snapshots are used in the client-server protocol to communicate the wavelet contents to the client when it opens the wavelet. Once the wavelet is open, the client receives incremental updates in deltas.


    REQUIREMENTS

    Wave and wavelet lookups

    When a client opens a wave, it asks the server for its user's wave view: the wavelets in the wave which the user has access to. The client can specify a pattern (a set of permitted prefixes) to restrict the wave view to wavelets with matching wavelet ids. Typically, a client asks for all conversation wavelets (with ids beginning with "conv+") and the user's user-data wavelet (with id "user+<username>").

    Deltas

    For a local wavelet, the wave server can begin writing a delta to storage any time after it has been applied, as long as deltas for the same wavelet are written in version order. It is important that it knows when the delta has reached persistent storage before it sends the commit notice to announce this to clients and other wave servers, in other words we need durability: upon notification of completion of the storage transaction, which writes the delta, the delta is in permanent storage.

    For a remote wavelet, the wave server should not store deltas before they are followed by a commit notice which attests that they have been persisted by the host wave server. The commit notice specifies a wavelet version number and corresponding history hash and it communicates that the contiguous wavelet delta history up that version has been persisted.

    Search: inbox and full text search

    The wave server needs to provide search functionality to clients. As a minimum it needs to present an inbox to every user with pointers to the waves that the given user participates in and, really, we want proper full text search, so users can find waves with text search queries. A full text search index should include tokens for metadata like participants and tags. Then the inbox query for a user is a special case of full text search, namely a search for the waves with the given user as participant. Search results should be returned in "last modified time" order, most recent first. (That's how Google Wave and most email clients work. It may be interesting to offer the results sorted by some other relevance criteria, but that's not something we plan to explore anytime soon.) The wave store needs to provide something to support this functionality so that when a wave server can serve search results from waves it doesn't hold in memory.

    Reliable notifications of wavelet changes.

TODO
    Normal end user clients don't need this because they reconnect after a server crash and then learn whatever they need about the wavelet state. For federation we don't yet have reliable message delivery, so this is just one of many reasons why a message may not be delivered. In the Google Wave production systems the only component which was hooked up to the server with reliable messaging was the indexer and that'll again be the most critical piece which needs something like this in Wave in a Box.

    Concurrency

    All access to the wave store happens via the wave server. We assume that any given wave is modified only from a single wave server process. (This is definitely true in the current single-server Wave in a Box architecture and this is also how the Google Wave production service operates.) This means that the wave store is not responsible for negotiating concurrent write access to individual waves. All the consistency requirements have to do with handling failures.

    Failures

    The failures that we need to handle correctly are:
    • The failure of any individual storage operation.
    • Server crashes.
    (We should also deal with data loss from permanent storage in a sane way but that's not covered in this design document.)

    Consistency

    The key consistency property that needs to be upheld, across storage operation failures and server crashes, is the following:
    • if any party (client, agent, robot, remote wave server) receives from the wave server a commit notice for a version of a wavelet, then any of the following queries from that party to the wave server will be successful:
      • any wave view lookup which matches the wavelet (unless subsequent deltas have removed the wavelet from the wave view),
      • any wavelet delta lookup for valid versions up to the committed version.

    Platforms

    Apart from the abovementioned functional requirements, we also want a solution that works across many platforms. To begin with we want to run Wave in a Box on a single server with local disk storage. Ultimately, we want a solution which also scales well to multi-server platforms but that's a less immediate concern.

    Scale

    The initial goal is to have a solution which works on a single server and supports a modest number of users and waves. More specifically, let's aim to support the following magnitudes:
    • hundreds of users,
    • thousands of waves,
    • hundreds of wavelets per wave,
    • thousands of deltas per wavelet.
    Note that the primary use for many wavelets in a wave is when there are many local users in the system who have a userdata wavelet for a given wave, so we don't need to support more wavelets per wave than the number of users.

    Performance

    • The time it takes to write a delta to permanent storage (commit latency) is not critical because a delta is propagated through the system before it is committed and the commit latency is hidden from end users. However, a long commit latency results in a large window when a failure can occur and lose data, so it's best to keep the commit latency below 10 seconds. Occasional outliers beyond that are acceptable.
    • The throughput performance is not very important for the initial implementation.


    RELATED WORK

    TODO: Mention the AppEngine-like wave store in the Google Wave production service.
    TODO: Mention the MongoDB and JDBC FedOne persistence patches.


    DESIGN

    The following design is based on file system directories and files rather than a higher-level database system. This choice was made because file systems have well-understood properties, including durability, and are supported by all platforms and the standard Java libraries. Note that this file system based design is less scalable than a database system like MongoDB and therefore we expect to revisit the design later to scale it better. The design is centered around the storage of deltas and the lookup of waves and wavelets. Everything else in the wave store, including wavelet snapshots and indexes, can be derived from the deltas.

    Wave store directory

    The wave store lives in a directory in a file system. The location of this "wave store root" directory should be a configuration option. Under the wave store root directory, there is a "waves" subdirectory, under which all wave data is stored.

    NOTE: If we implement user indexes as files, these can be stored in another "indexes" subdirectory under the wave store root directory.
    NOTE: We could store a version file in the wave store root directory, which the wave server can read at startup to determine the data format of the wave store.

    Id to directory/file name mapping

    Wave ids and wavelet ids can be represented as unicode strings:
    http://wave-protocol.googlecode.com/hg/spec/waveid/waveidspec.html#rfc.section.4.2
    We will use a one-to-one mapping from these ids to directory and file names that are valid in all file systems. To facilitate diagnostics, we choose a reversible mapping, so you can map a directory/file name back to its wave/wavelet id. The mapping we will use takes the UTF8 byte string representation of the id unicode string and encodes it as a lower case hexidecimal ASCII string.
    EXAMPLE: The wave id foo.com/w+j1_Q is mapped to the directory name 666f6f2e636f6d2f772b6a315f51. See also the example below of a directory and file layout of a wave store.

    Wave directory

    Each wave is represented as a subdirectory of the wave store's "waves" directory. The wave directory contains files with the wavelets in the wave. The directory name is derived from the wave id using our id to directory/file name mapping. If a wave directory is empty it is redundant and the wave store runtime can choose to remove it.

    NOTE: File systems limit the number of files or subdirectories per directory. For instance, ext3 has a limit of 32K. That should suffice for representing all the wavelets in a wave in a flat wave directory, if we expect at most hundreds user data wavelets per wave. A bigger problem is to have a subdirectory per wave in a single "waves" directory. We may need to have a deeper directory structure with, say, 1000 subdirectories under "waves" and then spread the actual waves across these 1000 top-level directories by a hash of their wave ids. For simplicity, the initial implementation will not do any of this.

    Deltas file

    A wavelet is stored as a sequence of delta records in an append-only file.
    • The file name is derived from the wavelet id using our id to directory/file name mapping plus the file extension ".deltas".
    • The deltas file begins with an identifier ("WAVE") followed by a 4 bytes big endian file format version number field. The version number for the format described here is 1.
    • After the initial version field the rest of the file contains back-to-back delta records. Each delta record has the following format:

    In ASCII art, this looks like:

    ++-------------------++------------------------------------------------------+-----------------------------++----
    || File header || Record header | Record body || ...
      ++-------------------++--------------+------------------+--------------------+-------------+---------------++----
    ||WAVE| file version || rec. version | appl. delta size | transf. delta size | appl. delta | transf. delta || ...
      ++----+--------------++--------------+------------------+--------------------+-------------+---------------++----
      0 4 8 12 16 20

    NOTE: A partially written delta record at the end of the
    deltas file can be detected, namely when the total length of the last record is smaller than either the fixed length header size or the total length which can be calculated from the fields in the fixed length header. In that case we can discard the partially written delta. This is safe because we may assume that the aborted write never succeeded and thus never caused a commit notice to be sent. (Do we risk that a partial write causes the file length to increment to include the length of the record to be written even if the actual write fails? In that case, we need to store a content hash with the record to detect a partial write. We could potentially co-opt the history hash from the last delta's resultingVersion.)

    Derived information: wavelet snapshots and indexes

    Deltas index: For efficient random access to deltas in a deltas file, it is useful to maintain an index from version numbers to byte offsets delta records in the deltas file. This can be done in several ways, including:

    • Scan the contents of the deltas file and building an in-memory index when the deltas file is first accessed. This is simple but will add significant latency to first-time access to wavelets.
    • Maintain an append-only index file where additional index information into the deltas file is added as the deltas file grows. If this file is written to after the deltas file is updated, we can maintain the consistency guarantee that any information in the index file is valid but that it may trail the contents of the deltas file. The wave server can detect any slack when it access the deltas and index files for a wavelet and then add to the index file to catch up. As for the deltas file, the wave server can also repair any incomplete write at the end of the index file, provided that the index record format that permit the detection of an incomplete write.
      • The initial implementation builds such an index file per wavelet. The file name is the same as the deltas file but with the file extension ".index".
        Its contents is a index file format version number followed by a sequence of 8-byte big endian numbers, where the Nth number is the deltas file byte offset of the delta record which begins with version N, or ~offset (1-complement of the offset) if N falls between delta boundaries. This makes it easy to find a delta given its end version as well, by looking at the previous record.

    NOTE: The advantage of this specific index file format is that makes it possible to lookup an arbitrary delta by version with a single, short read operation from the index file. The disadvantage is that it is not particularly compact. A more compact file, e.g., one that indexes only every hundredth version number, would be easier to load into memory when a wavelet is accessed, but will require more work to find an individual delta record in the deltas file.

    ASCII art version of the index file. Offset(x) is the file offset in the delta file of the delta which was applied at version x.

    Delta 1 contains 3 operations
    +------------+------------+------------+------------+------------+----
    | Offset(0) | Offset(1) | ~Offset(1) | ~Offset(1) | Offset(4) | ...
    +------------+------------+------------+------------+------------+----
    0 8 16 24 32 40


    Wavelet snapshots: The initial implementation builds a snapshot from the deltas file when a wavelet is first accessed. This is simple but adds
    significant latency to first-time access to wavelets.
TODO: Store wavelet snapshots in permanent storage.

Inbox and text search: We haven't decided on the initial implementation of search. A simple file-based implementation could provide answer inbox queries by storing a per-user inbox file with pointers to all waves that a given user participates in.
  • One way to do it is to make the inbox file append-only and append something an index record whenever the user is added to or removed from a wave. If the inbox file is written to after deltas are written to deltas files, then we can maintain the consistency guarantee that any wave pointer in the index file is valid for some point in a wave's version history but may trail the contents of the wave's deltas files. (TODO: Work out the format of the inbox file and what to do about incomplete writes and stale inbox files in case of failures.)
  • Alternatively, if the wave server maintains an inbox in memory for every user, then the wave server can regularly write the entire inbox for each user to disk. Again the inbox would trail changes to the underlying waves. (TODO: Work out details.)
Comments