Reading Microsoft OneNote files
This weekend, I discovered the replacement of OneNote for Windows 10
contains a feature to back-up the notebooks to a .one file. This opens the
door to possibility of reading my notes and writing a way to convert them to
Markdown or reStructuredText in the not so far future.
Firstly, let us introduce the specification for the file format .one called
[MS-ONESTORE]. I had previously came across the specification for
the OneNote format well specially [MS-ONESTORE] back in December 2022 and
stopped working on on Christmas eve. The reason for that was the files I had
were in the older format as they came from Microsoft OneNote from my university
laptop.
The benefit of that past attempt was it produced the beginnings of a Python
module for working with files in the OneNote Revision Store File Format a.k.a
the [MS-ONESTORE] format.
Introduction
If you are keen to know more about the format then read the start with the “MS-ONESTORE: OneNote Revision Store File Format” document as it will lay it out better than I can. For the project, my focus was in reading the data out and less so learning the entire format.
At the start of the file is the file header described in section 2.3.1 of the specification document MS-ONESTORE, which going forward when the specification is mentioned that is what it is referring to.
The first thing in this header is a 16-bit GUID (Globally Unique Identifier),
which outside of Microsoft is more widely known as a Universally Unique
Identifier (UUID). There are two possible GUID this can represent the first is
for the .one format which is what my project is interested in. It has a
value of {7B5C52E4-D88C-4DA7-AEB1-5378D02996D3}, where the other format
is .onetoc2 which has a value of {43FF2FA1-EFD9-4C76-9EE2-10EA5722765F}.
The twenty fourth thing in this header is the second thing that is most
important in order to read it which is the fcrFileNodeListRoot or expanded
out is the file chunk reference (more on that soon) to the root file node list.
A file chunk reference (FCR) is composed of two integers, a offset to the location
of the referenced data (called stp in the specification) in the file and a
size of the referenced data (called cb in the specification). These are
essentially pointers within the file to other parts of the file.
File Node List
A file node list is the basic logical structure used to organise data in the file and they contain a sequence of FileNode structures that can either contain data, references to data, and references to other file nodes. The last part essentially means the list is a tree in practice.
The list is split into fragments where a fragment ends with the location ( a file chunk reference) to the next fragment.
Back to reading the file, given the file chunk reference to the root file node list, within the file we seek to the offset and that takes us to the start of the file node list.
The following code reads the start of the node list, specially the first
FileNodeListFragment which starts the node list.
def read_node_List(reader, reference: FileChunkReference):
reader.seek(reference.offset)
data = reader.read(reference.size)
magic, file_node_list_id, fragment_sequence_count = struct.unpack_from(
"<QII", data)
if magic != 0xA4567AB1F5F7F4C4:
raise ValueError(
'FileNodeListHeader did not start with the magic number.')
if file_node_list_id < 0x00000010:
raise ValueError(
'FileNodeListID must be equal to or greater than 0x00000010.')
next_offset, next_size, footer = struct.unpack_from("<QIQ", data[-20:])
if footer != 0x8BC215C38233BA4B:
raise ValueError(
'The FileNodeListFragment did not end with the magic number for '
'the footer.')
...
The fragment_sequence_count starts at 0 for first fragment in the root file
list node, and each subsequent sequence will have a different number.
This essentially just reads the start and end of the fragment and checks it what is expected based on the magic numbers.
After the fragment header (the first three components) is the first FileNode.
The FileNode starts with a header.
- FileNodeListFragment
- FragmentHeader [magic, file node list ID, fragment sequence count]
- FileNode
- FileNodeHeader - this is a 32-bit value (4 bytes)
- File Node ID - Specifies the type of the file node
- Size of the FileNode structure in bytes.
- Format of the offset (
stp) in the potential reference in the file node (the value can be 8, 4 or 2 bytes and can be compressed or uncompressed). - Format of the size (
cb) in the potential reference in the file node. - Base Type - Specifies whether the file node data field a FileNodeChunkReference (if base type is 2 then it starts with a reference).
- FileNodeHeader - this is a 32-bit value (4 bytes)
- FileNode
- FileNode
- Padding
- FragmentFooter [reference to next fragment (offset and size), magic]
A FileChunkReference that can be compressed via the formats described in the
FileNodeHeader. If the format says it is compressed then the stored value is
multiplied by 8 to decompress it.
First File Node
The first file node in the file I was working with was a
ObjectSpaceManifestListStartFND which is defined in section 2.5.3 of
MS-ONESTORE. I knew this because the first File Node ID read was 0x00C which
corresponds to the former structure. The list of file node ID values and their
corresponding structures can be seen in a table in section 2.4.3.
This object is made-up of the 20-bytes ExtendedGUID structure which specifies
the identity of the object space by this object space manifest list.
At this stage, I have no idea what a object space really means as I haven’t
tackled diving deep ino that just yet, all that is important for reading the
file nodes is to know that its a ExtendedGUID from section 2.5.1
A ExtendedGUID is the 16-byte GUID plus a 4-byte unsigned integer called
n.
class ExtendedGuid:
"""The combination of a GUID and an unsigned integer.
See [MS-ONESTORE] Section 2.5.1
"""
def __init__(self, raw_bytes):
self.guid = uuid.UUID(bytes_le=raw_bytes[:16])
# If the GUID is all 0 then n must be 0.
self.n = int.from_bytes(raw_bytes[16:16+4], byteorder='little')
def __repr__(self):
return f"}, n={self.n}"
The representation of the file node structures in Python that I used is defining them each as a data class.
@dataclasses.dataclass
class ObjectSpaceManifestListStartFND:
"""The FileNode structure that specifies the beginning of an object space
manifest list.
See [MS-ONESTORE] Section 2.5.3
"""
FILE_NODE_ID: typing.ClassVar[int] = 0x00C
"""The ID of this type of file node."""
gosid: ExtendedGuid
"""The identity of the object space as specified in the object space
manifest list.
See [MS-ONESTORE] Section 2.1.4 for object space.
"""
`
Next Nodes
The process of the rest file nodes has been:
- Create the dataclass that represents the next node encountered
- Add parsing logic to read the members of the file node
- Repeat
The list of nodes that my sample file encountered so far are:
- DataSignatureGroupDefinitionFND
- GlobalIdTableEndFNDX
- GlobalIdTableEntryFNDX
- GlobalIdTableStart2FND
- ObjectDeclaration2RefCountFND
- ObjectGroupEndFND
- ObjectGroupListReferenceFND
- ObjectGroupStartFND
- ObjectInfoDependencyOverridesFND
- ObjectSpaceManifestListReferenceFND
- ObjectSpaceManifestListStartFND
- ObjectSpaceManifestRootFND
- ReadOnlyObjectDeclaration2RefCountFND
- RevisionManifestEndFND
- RevisionManifestListReferenceFND
- RevisionManifestListStartFND
- RevisionManifestStart6FND
- RootObjectReference3FND
Child file nodes
The other thing to handle was reading the child nodes.
If the base type in the FileNodeHeader has a value of 2 it means that the
first thing after the header is a reference to another file node list.
That file node list are the children of the node.
To read that you seek to the offset in the file chunk reference which contains
a FileNodeListFragment and so you start reading them.
It should be mentioned that this had to be implemented before several of the nodes listed above were encountered.
Stopping point
The point that I stopped working on this project on the weekend was when I had
read all the file node up until a FileNodeListFragment was encountered which
pointed to another FileNodeListFragment (i.e. the next fragment was set).
Reflection
The approach taken of waiting to set-up a new FileNode structure when it was
encountered was not a good approach. This is because it meant writing up the new
class to represent the structure, adding the documentation, then writing the
parsing logic for it before continuing on with the next one. It meant a lot
of searching about as well to find the subsection numbers and the file node IDs.
The approach I wish I taken once I had the basics working and the first couple
of FileNode structures handled was to simply go through the entire list of
structures one by one and creating the classes for it. This way the members of
the structures would already be present and documented and the next step would
be to implement the parsing for it.
Update - 2025-06-09
- Handle going to the next file node fragment. -
- Completely read the root file node list (tree).
Next
Next up in the implementation would be would be reading the page names from the sections.