Reading Microsoft OneNote files

This weekend, I discovered the replacement of OneNote for Windows 10 contains a feature to back-up the notebooks to a .one file. This opens the door to possibility of reading my notes and writing a way to convert them to Markdown or reStructuredText in the not so far future.

Firstly, let us introduce the specification for the file format .one called [MS-ONESTORE]. I had previously came across the specification for the OneNote format well specially [MS-ONESTORE] back in December 2022 and stopped working on on Christmas eve. The reason for that was the files I had were in the older format as they came from Microsoft OneNote from my university laptop.

The benefit of that past attempt was it produced the beginnings of a Python module for working with files in the OneNote Revision Store File Format a.k.a the [MS-ONESTORE] format.

Introduction

If you are keen to know more about the format then read the start with the “MS-ONESTORE: OneNote Revision Store File Format” document as it will lay it out better than I can. For the project, my focus was in reading the data out and less so learning the entire format.

At the start of the file is the file header described in section 2.3.1 of the specification document MS-ONESTORE, which going forward when the specification is mentioned that is what it is referring to.

The first thing in this header is a 16-bit GUID (Globally Unique Identifier), which outside of Microsoft is more widely known as a Universally Unique Identifier (UUID). There are two possible GUID this can represent the first is for the .one format which is what my project is interested in. It has a value of {7B5C52E4-D88C-4DA7-AEB1-5378D02996D3}, where the other format is .onetoc2 which has a value of {43FF2FA1-EFD9-4C76-9EE2-10EA5722765F}.

The twenty fourth thing in this header is the second thing that is most important in order to read it which is the fcrFileNodeListRoot or expanded out is the file chunk reference (more on that soon) to the root file node list.

A file chunk reference (FCR) is composed of two integers, a offset to the location of the referenced data (called stp in the specification) in the file and a size of the referenced data (called cb in the specification). These are essentially pointers within the file to other parts of the file.

File Node List

A file node list is the basic logical structure used to organise data in the file and they contain a sequence of FileNode structures that can either contain data, references to data, and references to other file nodes. The last part essentially means the list is a tree in practice.

The list is split into fragments where a fragment ends with the location ( a file chunk reference) to the next fragment.

Back to reading the file, given the file chunk reference to the root file node list, within the file we seek to the offset and that takes us to the start of the file node list.

The following code reads the start of the node list, specially the first FileNodeListFragment which starts the node list.

def read_node_List(reader, reference: FileChunkReference):
    reader.seek(reference.offset)
    data = reader.read(reference.size)

    magic, file_node_list_id, fragment_sequence_count = struct.unpack_from(
        "<QII", data)

    if magic != 0xA4567AB1F5F7F4C4:
        raise ValueError(
            'FileNodeListHeader did not start with the magic number.')

    if file_node_list_id < 0x00000010:
        raise ValueError(
            'FileNodeListID must be equal to or greater than 0x00000010.')

    next_offset, next_size, footer = struct.unpack_from("<QIQ", data[-20:])
    if footer != 0x8BC215C38233BA4B:
        raise ValueError(
            'The FileNodeListFragment did not end with the magic number for '
            'the footer.')
    ...

The fragment_sequence_count starts at 0 for first fragment in the root file list node, and each subsequent sequence will have a different number.

This essentially just reads the start and end of the fragment and checks it what is expected based on the magic numbers.

After the fragment header (the first three components) is the first FileNode. The FileNode starts with a header.

FileNodeListFragment
- FragmentHeader [magic, file node list ID, fragment sequence count]
- FileNode
  - FileNodeHeader - this is a 32-bit value (4 bytes)
    - File Node ID - Specifies the type of the file node
    - Size of the FileNode structure in bytes.
    - Format of the offset (stp) in the potential reference in the file node (the value can be 8, 4 or 2 bytes and can be compressed or uncompressed).
    - Format of the size (cb) in the potential reference in the file node.
    - Base Type - Specifies whether the file node data field a FileNodeChunkReference (if base type is 2 then it starts with a reference).
- FileNode
- FileNode
- Padding
- FragmentFooter [reference to next fragment (offset and size), magic]

A FileChunkReference that can be compressed via the formats described in the FileNodeHeader. If the format says it is compressed then the stored value is multiplied by 8 to decompress it.

First File Node

The first file node in the file I was working with was a ObjectSpaceManifestListStartFND which is defined in section 2.5.3 of MS-ONESTORE. I knew this because the first File Node ID read was 0x00C which corresponds to the former structure. The list of file node ID values and their corresponding structures can be seen in a table in section 2.4.3.

This object is made-up of the 20-bytes ExtendedGUID structure which specifies the identity of the object space by this object space manifest list. At this stage, I have no idea what a object space really means as I haven’t tackled diving deep ino that just yet, all that is important for reading the file nodes is to know that its a ExtendedGUID from section 2.5.1

A ExtendedGUID is the 16-byte GUID plus a 4-byte unsigned integer called n.

class ExtendedGuid:
    """The combination of a GUID and an unsigned integer.

    See [MS-ONESTORE] Section 2.5.1
    """
    def __init__(self, raw_bytes):
        self.guid = uuid.UUID(bytes_le=raw_bytes[:16])
        # If the GUID is all 0 then n must be 0.
        self.n = int.from_bytes(raw_bytes[16:16+4], byteorder='little')

    def __repr__(self):
        return f"[], n={self.n}"

The representation of the file node structures in Python that I used is defining them each as a data class.

@dataclasses.dataclass
class ObjectSpaceManifestListStartFND:
    """The FileNode structure that specifies the beginning of an object space
    manifest list.

    See [MS-ONESTORE] Section 2.5.3
    """

    FILE_NODE_ID: typing.ClassVar[int] = 0x00C
    """The ID of this type of file node."""

    gosid: ExtendedGuid
    """The identity of the object space as specified in the object space
    manifest list.

    See [MS-ONESTORE] Section 2.1.4 for object space.
    """

Next Nodes

The process of the rest file nodes has been:

Create the dataclass that represents the next node encountered
Add parsing logic to read the members of the file node
Repeat

The list of nodes that my sample file encountered so far are:

DataSignatureGroupDefinitionFND
GlobalIdTableEndFNDX
GlobalIdTableEntryFNDX
GlobalIdTableStart2FND
ObjectDeclaration2RefCountFND
ObjectGroupEndFND
ObjectGroupListReferenceFND
ObjectGroupStartFND
ObjectInfoDependencyOverridesFND
ObjectSpaceManifestListReferenceFND
ObjectSpaceManifestListStartFND
ObjectSpaceManifestRootFND
ReadOnlyObjectDeclaration2RefCountFND
RevisionManifestEndFND
RevisionManifestListReferenceFND
RevisionManifestListStartFND
RevisionManifestStart6FND
RootObjectReference3FND

Child file nodes

The other thing to handle was reading the child nodes. If the base type in the FileNodeHeader has a value of 2 it means that the first thing after the header is a reference to another file node list. That file node list are the children of the node.

To read that you seek to the offset in the file chunk reference which contains a FileNodeListFragment and so you start reading them.

It should be mentioned that this had to be implemented before several of the nodes listed above were encountered.

Stopping point

The point that I stopped working on this project on the weekend was when I had read all the file node up until a FileNodeListFragment was encountered which pointed to another FileNodeListFragment (i.e. the next fragment was set).

Reflection

The approach taken of waiting to set-up a new FileNode structure when it was encountered was not a good approach. This is because it meant writing up the new class to represent the structure, adding the documentation, then writing the parsing logic for it before continuing on with the next one. It meant a lot of searching about as well to find the subsection numbers and the file node IDs.

The approach I wish I taken once I had the basics working and the first couple of FileNode structures handled was to simply go through the entire list of structures one by one and creating the classes for it. This way the members of the structures would already be present and documented and the next step would be to implement the parsing for it.

Update - 2025-06-09

Handle going to the next file node fragment. -
Completely read the root file node list (tree).

Next up in the implementation would be would be reading the page names from the sections.