Authors: Soren Lassen, Joseph Gentle, Frédéric Gobry Date: November 2010
BACKGROUNDDeltasDeltas are Wave's primary units of information exchange and storage. A delta is a sequence of consecutive operations by a single author which apply to one wavelet. They are applied atomically and the delta has an application timestamp issued by the hosting wave server. Clients (including agents, and robots) participate in a wavelet by exchanging deltas via the wavelet's host wave server. The server applies deltas submitted by a participant and publishes the applied deltas to all participants. The state of the wavelet is completely defined by its delta history: the historical sequence of all deltas applied to the wavelet.A wave server stores all deltas for the wavelets that it hosts ("local wavelets") and the wavelets hosted elsewhere ("remote wavelets") that its users participate in. Apply and commitFor a local wavelet, the wave server is responsible for two steps when a delta is submitted by a local user or a remote wave server:
NOTE: The federation protocol, client-server protocol, and wave server architecture would all be simpler if the two steps, delta application and persistence, were combined into one and the wave server only published an applied delta to other users and wave servers after it was persisted. The more complicated two step process was chosen to minimize the latency of propagating operations from the party (client, agent, robot) who generates them to other online parties. In particular, this is intended to enhance the immediacy of the "live typing" user experience in Wave. Wavelet snapshotsThe delta history is an impractical representation of the wavelet state in some contexts, because the history can be long and it can be laborious to compose the insertion and deletion operations in the deltas to compute the wavelet contents. If deltas are Wave's primary information units or data type, a useful secondary data type is that of a wavelet snapshot: a direct representation of the wavelet contents at some version (without historical information). Wavelet snapshots are used in the client-server protocol to communicate the wavelet contents to the client when it opens the wavelet. Once the wavelet is open, the client receives incremental updates in deltas.REQUIREMENTSWave and wavelet lookupsWhen a client opens a wave, it asks the server for its user's wave view: the wavelets in the wave which the user has access to. The client can specify a pattern (a set of permitted prefixes) to restrict the wave view to wavelets with matching wavelet ids. Typically, a client asks for all conversation wavelets (with ids beginning with "conv+") and the user's user-data wavelet (with id "user+<username>").DeltasFor a local wavelet, the wave server can begin writing a delta to storage any time after it has been applied, as long as deltas for the same wavelet are written in version order. It is important that it knows when the delta has reached persistent storage before it sends the commit notice to announce this to clients and other wave servers, in other words we need durability: upon notification of completion of the storage transaction, which writes the delta, the delta is in permanent storage.For a remote wavelet, the wave server should not store deltas before they are followed by a commit notice which attests that they have been persisted by the host wave server. The commit notice specifies a wavelet version number and corresponding history hash and it communicates that the contiguous wavelet delta history up that version has been persisted.
http://groups.google.com/group/wave-protocol/msg/786f64cdf3d67953 Its implementation would require durable storage of remote wavelets.
TODO
ConcurrencyAll access to the wave store happens via the wave server. We assume that any given wave is modified only from a single wave server process. (This is definitely true in the current single-server Wave in a Box architecture and this is also how the Google Wave production service operates.) This means that the wave store is not responsible for negotiating concurrent write access to individual waves. All the consistency requirements have to do with handling failures.FailuresThe failures that we need to handle correctly are:ConsistencyThe key consistency property that needs to be upheld, across storage operation failures and server crashes, is the following:PlatformsApart from the abovementioned functional requirements, we also want a solution that works across many platforms. To begin with we want to run Wave in a Box on a single server with local disk storage. Ultimately, we want a solution which also scales well to multi-server platforms but that's a less immediate concern.ScaleThe initial goal is to have a solution which works on a single server and supports a modest number of users and waves. More specifically, let's aim to support the following magnitudes:PerformanceRELATED WORKTODO: Mention the AppEngine-like wave store in the Google Wave production service. TODO: Mention the MongoDB and JDBC FedOne persistence patches. DESIGNThe following design is based on file system directories and files rather than a higher-level database system. This choice was made because file systems have well-understood properties, including durability, and are supported by all platforms and the standard Java libraries. Note that this file system based design is less scalable than a database system like MongoDB and therefore we expect to revisit the design later to scale it better. The design is centered around the storage of deltas and the lookup of waves and wavelets. Everything else in the wave store, including wavelet snapshots and indexes, can be derived from the deltas.Wave store directoryThe wave store lives in a directory in a file system. The location of this "wave store root" directory should be a configuration option. Under the wave store root directory, there is a "waves" subdirectory, under which all wave data is stored.NOTE: If we implement user indexes as files, these can be stored in another "indexes" subdirectory under the wave store root directory. NOTE: We could store a version file in the wave store root directory, which the wave server can read at startup to determine the data format of the wave store. Id to directory/file name mappingWave ids and wavelet ids can be represented as unicode strings:http://wave-protocol.googlecode.com/hg/spec/waveid/waveidspec.html#rfc.section.4.2 We will use a one-to-one mapping from these ids to directory and file names that are valid in all file systems. To facilitate diagnostics, we choose a reversible mapping, so you can map a directory/file name back to its wave/wavelet id. The mapping we will use takes the UTF8 byte string representation of the id unicode string and encodes it as a lower case hexidecimal ASCII string.
Wave directoryEach wave is represented as a subdirectory of the wave store's "waves" directory. The wave directory contains files with the wavelets in the wave. The directory name is derived from the wave id using our id to directory/file name mapping. If a wave directory is empty it is redundant and the wave store runtime can choose to remove it.
NOTE: File systems limit the number of files or subdirectories per directory. For instance, ext3 has a limit of 32K. That should suffice for representing all the wavelets in a wave in a flat wave directory, if we expect at most hundreds user data wavelets per wave. A bigger problem is to have a subdirectory per wave in a single "waves" directory. We may need to have a deeper directory structure with, say, 1000 subdirectories under "waves" and then spread the actual waves across these 1000 top-level directories by a hash of their wave ids. For simplicity, the initial implementation will not do any of this. Deltas fileA wavelet is stored as a sequence of delta records in an append-only file.In ASCII art, this looks like: ++-------------------++------------------------------------------------------+-----------------------------++---- || File header || Record header | Record body || ... ++-------------------++--------------+------------------+--------------------+-------------+---------------++---- ||WAVE| file version || rec. version | appl. delta size | transf. delta size | appl. delta | transf. delta || ... ++----+--------------++--------------+------------------+--------------------+-------------+---------------++---- 0 4 8 12 16 20
Deltas index: For efficient random access to deltas in a deltas file, it is useful to maintain an index from version numbers to byte offsets delta records in the deltas file. This can be done in several ways, including:
NOTE: The advantage of this specific index file format is that makes it possible to lookup an arbitrary delta by version with a single, short read operation from the index file. The disadvantage is that it is not particularly compact. A more compact file, e.g., one that indexes only every hundredth version number, would be easier to load into memory when a wavelet is accessed, but will require more work to find an individual delta record in the deltas file. ASCII art version of the index file. Offset(x) is the file offset in the delta file of the delta which was applied at version x. Delta 1 contains 3 operations +------------+------------+------------+------------+------------+---- | Offset(0) | Offset(1) | ~Offset(1) | ~Offset(1) | Offset(4) | ... +------------+------------+------------+------------+------------+---- 0 8 16 24 32 40 TODO: Store wavelet snapshots in permanent storage. Inbox and text search: We haven't decided on the initial implementation of search. A simple file-based implementation could provide answer inbox queries by storing a per-user inbox file with pointers to all waves that a given user participates in.
EXAMPLEThe following is an example of the directory layout and files of two waves, "foo.com/w+j1_Q" with a local root wavelet and a single userdata wavelet, and "bar.com/w+9-uY" with a remote root wavelet, a local private reply, and two userdata wavelets. In the example foo.com is a local domain and bar.com is not. wave_store_root/ | |- waves/ | | | |- 666f6f2e636f6d2f772b6a315f51/ # foo.com/w+j1_Q | | | | | |- 666f6f2e636f6d2f636f6e762b726f6f74.deltas # foo.com/conv+root (conversation) | | |- 666f6f2e636f6d2f636f6e762b726f6f74.index | | |- 666f6f2e636f6d2f757365722b616c40666f6f2e636f6d.deltas # foo.com/user+al@foo.com (userdata) | | |- 666f6f2e636f6d2f757365722b616c40666f6f2e636f6d.index | | | |- 666f6f2e636f6d2f772b6a315f51/ # bar.com/w+9-uY | | | | | |- 6261722e636f6d2f636f6e762b726f6f74.deltas # bar.com/conv+root (conversation) | | |- 6261722e636f6d2f636f6e762b726f6f74.index | | |- 666f6f2e636f6d2f636f6e762b51396d6a.deltas # foo.com/conv+Q9mj (private reply) | | |- 666f6f2e636f6d2f636f6e762b51396d6a.index | | |- 666f6f2e636f6d2f757365722b616c40666f6f2e636f6d.deltas # foo.com/user+al@foo.com (userdata) | | |- 666f6f2e636f6d2f757365722b616c40666f6f2e636f6d.index | | |- 666f6f2e636f6d2f757365722b626f40666f6f2e636f6d.deltas # foo.com/user+bo@foo.com (userdata) | | |- 666f6f2e636f6d2f757365722b626f40666f6f2e636f6d.index | | | APITODO: Describe DeltaStore and WaveletStore interfacesSOURCE CODEThe implementation is in progress but nothing is operational yet.DeltaStore interface and in-memory and file-based implementations are here: WaveletStore interface and simplistic implementation here: Much remains, including unittests and hookup with the other wave server logic. AcknowledgementsAlex North gave feedback on an early version of this document. |