#import "@preview/clean-cnam-template:1.2.0": *
#import "template.typ": *


#show: clean-cnam-template.with(
  title: "mzCBOR format documentation",
  author: "Olivier Langella",
  class: "technical documentation",
  affiliation: "PAPPSO",
  //logo: image("./assets/cnam_logo.svg"),
  start-date: datetime(day: 05, month: 12, year: 2025),
  main-color: navy,
  default-font: "New Computer Modern Math",
  code-font: "Andale Mono",
  outline-code: outline(
    title: "Table of content",
    depth: 2,
    indent: auto,
  ),
)

#set text(lang: "en")
#set heading(numbering: "1.1")

#show figure.where(kind: "attribute"): set align(start)

#show ref: it => {
    show link: it => {
      show text: body => {
        set text(fill: rgb("#993300"))
        smallcaps(body)
      }
      it
    }
    let el = it.element
    if el.func() == figure  and el.kind == "attribute" {
        link(el.location(),el.supplement)
    } else {
      link(el.location(),el.body)
    }
  }


= Introduction

the CBOR-based format (mzCBOR) format replicates the structure of the mzML-formatted file
into a CBOR binary file.

The Concise Binary Object Representation (CBOR) is a binary data format
designed by the Internet Engineering Task Force (IETF, #link("https://www.rfc-editor.org/rfc/rfc8949.html","RFC 8949")). Its aim
is to increase processing and transfer speeds, albeit at the cost of human
data readability.

The use of this format as a replacement for mzML is intended not only to save
storage space but also because of speed benefits during parsing and
processing.

= Why using CBOR file format ?

The CBOR file format is relatively recent (december 2020). It has several objectives, very well explained in this note :
#link("https://www.rfc-editor.org/rfc/rfc8949.html","https://www.rfc-editor.org/rfc/rfc8949.html").

Among these objectives, the most important points are:
- Simple data structure, similar to JSON.
- Data compactness
- Frugal in CPU usage for both encoding and decoding.
- Suitable for high volume of data.
- Support all JSON data types for conversion to and from JSON.
- The format is designed for decades of use and extensible.

This makes it a format of choice for Mass Spectrometry data that must be stored reliably for decades while being reasonably compact.

= mzCBOR relation with mzML

mzCBOR is an exact binary representation of an mzML data file.
Each XML element and attributes is simply translated to data trees and arrays.

For example :

```xml
<cvList count="2">
    <cv id="MS" fullName="Mass spectrometry ontology" version="4.1.38" URI="https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo" />
    <cv id="UO" fullName="Unit Ontology" version="09:04:2014" URI="https://raw.githubusercontent.com/bio-ontology-research-group/unit-ontology/master/unit.obo" />
</cvList>
```

is translated to :

```json
"cvList": {
    "count": 2,
    "cv": [
        {id: "MS", "fullName": "Mass spectrometry ontology", "version": "4.1.38", "URI": "https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo"},
        {id: "UO", "fullName": "Unit Ontology", "version": "09:04:2014", "URI": "https://raw.githubusercontent.com/bio-ontology-research-group/unit-ontology/master/unit.obo"}
    ]
}
```

- Repeated elements are stored in arrays (this is the case in this example for "cv")
- String values representing integers are stored as 64 bits integers
- String values representing floats are stored as 64 bits floats

== Binary data array encoding

Unlike any other mzML elements, the "binaryDataArrayList" element has a specific translation.

```xml
<binaryDataArrayList count="2">
    <binaryDataArray encodedLength="6380">
        <cvParam cvRef="MS" accession="MS:1000514" value="" name="m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
        <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
        <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
        <binary>eJwt03tcj3f/wPGr...</binary>
    </binaryDataArray>
    <binaryDataArray encodedLength="9092">
        <cvParam cvRef="MS" accession="MS:1000515" value="" name="intensity array" unitAccession="MS:1000131" unitName="number of counts" unitCvRef="MS" />
        <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
        <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
        <binary>eJwtWnlcCGsbnZKK...</binary>
    </binaryDataArray>
</binaryDataArrayList>
```

It is converted to an array :

```json
"binaryDataArray": [
    {
        "bits": 64,
        "isInt": false,
        "compress": "zlib",
        "unit": "MS:1000514",
        "byteArray": 010101010101...,
    },
    {
        "bits": 64,
        "isInt": false,
        "compress": "zlib",
        "unit": "MS:1000515",
        "byteArray": 010101010101...,
    }
]
```

The base64 encoded mzML "binary" element is decoded into a byteArray (compressed if mzML was also containing compressed data).

Indeed, storing binary data as a base64 string is very expensive in storage space compared to the same data in binary string.

Any compression method used in mzML is allowed in mzCBOR.

== mzCBOR header

The first element of the mzCBOR file is an "mzCBOR" element containing informations on the conersion software and universal identifier.
The second element is the original mzML element with attributes from the original mzML file.

```json
{
    "mzCBOR": {
        "mode": 0,
        "informations": {
            "software": "libpappsomspp",
            "version": "0.11.6",
            "type": "mzCBOR",
            "operation": "mzMLconvert",
            "cpu_used": 40,
            "pappsomspp_version": "0.11.6",
            "sysinfo_machine_hostname": "milano",
            "sysinfo_product_name": "Debian GNU/Linux 13 (trixie)",
            "timestamp": "2025-12-02T14:07:13",
            "uuid": "{2292d028-fbb7-485a-a82a-f5a03abc0267}"
        }
    },
    "mzML": {
        "xmlns": "http://psi.hupo.org/ms/mzml",
        "schemaLocation": "http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd",
        "version": "1.1.0",
        "id": "20120906_raw_extract_1_A01_urnb-1"
    },
    "cvList":..
}
```

= mzML to mzCBOR Conversion tool



```bash
mzml2mzcbor -i LFQ_Astral_DDA_15min_50ng_Condition_A_REP1.mzML -o LFQ_Astral_DDA_15min_50ng_Condition_A_REP1.mzcbor
```

In this example, the original file size of "LFQ_Astral_DDA_15min_50ng_Condition_A_REP1.mzML" is 1,9G. The "LFQ_Astral_DDA_15min_50ng_Condition_A_REP1.mzcbor" file size is 1,3G, for exactly the same amount of data.


= mzCBOR to mzML Conversion tool



```bash
mzcbor2mzml -i LFQ_Astral_DDA_15min_50ng_Condition_A_REP1.mzcbor -o LFQ_Astral_DDA_15min_50ng_Condition_A_REP1.mzML
```

The resulting mzML file is the exact image of the original mzML file with automatic XML indentation and real UTF-8 encoding (which is not particularly well guaranted is XML bioinformatics output).


= Storage efficiency

mzCBOR file size is 66% the mzML one.

= CPU efficiency

#table(
  columns: (auto, auto, auto),
  inset: 10pt,
  align: horizon,
  table.header(
    [], [*mzML*], [*mzCBOR*],
  ),
  [TIC chromatogram],[25.8692 s],[18.3258 s],
  [Retention timeline],[8.02264 s], [7.16747 s],
  [Random access], [24.6577 ms],[786.731 #sym.mu;s],
)
