ISCC - Specification v1.x#

Last revised: 2022-10-10 16:43:08

Warning

This document is an out of date early draft and retained for historic reasons only. Please follow current development at https://ieps.iscc.codes

Abstract#

The International Standard Content Code (ISCC), is an open and decentralized digital media identifier. An ISCC can be created from digital content and its basic metadata by anybody who follows the procedures of the ISCC specification or by using open source software that supports ISCC creation conforming to the ISCC specification.

Note to Readers#

For public discussion of issues for this specification, please use the Github issue tracker: https://github.com/iscc/iscc-specs/issues.

If you want to chat with developers, join us on Telegram at https://t.me/iscc_dev.

You can find the latest version of this specification at http://iscc.codes/specification/.

Public review, discussion and contributions are welcome.

About this Document#

Document Version

While there is already a Version 1.0 spec, we are still expecting backward incompatible changes until Version 2.0. Parts of this specification may become stable earlier. We will document this during minor releases. We encourage partners to follow development and test, implement, and give feedback based on the latest (this) version of the ISCC Specification.

This document proposes an open and vendor neutral ISCC standard and describes the technical procedures to create and manage ISCC identifiers. The first version of this document resulted from a prototyping project by the Content Blockchain Project and received funding from the Google Digital News Initiative (DNI). The content of this document results from a voluntary effort of the authors with an open and public consensus process.

Conventions and Terminology#

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119].

Definitions#

Basic Metadata:: Minimal set of ISCC-specific top-level metadata that applications SHOULD support.
Bound Metadata:: Metadata that is utilized during generation of the ISCC. A change of this bound metadata may therefore impact the derived ISCC..
Extended Metadata:: Industry and application-specific metadata attached to an ISCC.
Character:: Throughout this specification a character is meant to be interpreted as one Unicode code point. This also means that due to the structure of Unicode a character is not necessarily a full glyph but might be a combining accent or similar.
Digital Media Object:: A blob of raw bytes with some media type specific encoding.
Extended Metadata:: Metadata that is not encoded within the ISCC Meta-Code but may be supplied together with the ISCC.
Generic Media Type:: A basic content type such as plain text in a normalized and generic (UTF-8) encoding format.
ISCC:: International Standard Content Code
ISCC Code:: The printable text encoded representation of an ISCC
ISCC Digest:: The raw binary data of an ISCC

Introduction#

An ISCC permanently identifies content at multiple levels of granularity. It is algorithmically generated from basic metadata and the contents of a digital media object. It is designed for being registered and stored on a public and decentralized blockchain. An ISCC for a media object can be created and registered by the content author, a publisher, a service provider or anybody else. By itself the ISCC and its basic registration on a blockchain does not make any statement or claim about authorship or ownership of the identified content.

ISCC Structure#

A Fully Qualified ISCC Digest is a fixed size sequence of 36 bytes (288 bits) assembled from multiple sub-components. The Fully Qualified ISCC Code is a 52 character encoded printable string representation of a complete ISCC Digest. This is a high-level overview of the ISCC creation process:

iscc-creation-process

ISCC Components#

The ISCC Digest is built from multiple self-describing 72-bit components:

Components:	Meta-Code	Content-Code	Data-Code	Instance-Code
Context:	Intangible creation	Content similarity	Data similarity	Data checksum
Input:	Metadata	Extracted content	Raw data	Raw data
Algorithms:	Similarity Hash	Type specific	CDC, Minimum Hash	Hash Tree
Size:	72 bits	72 bits	72 bits	72 bits

ISCC components MAY be used separately or in combination by applications for various purposes. Individual components MUST be presented as 13-character base58-iscc encoded strings to end users and MAY be prefixed with their component name.

Single component ISCC-Code (13 characters)

Meta-Code: CCDFPFc87MhdT

Combinations of components MUST include the Meta-Code component and MUST be ordered as Meta-Code, Content-Code, Data-Code, and Instance-Code. Individual components MAY be skipped and SHOULD be separated with hyphens. A combination of components SHOULD be prefixed with "ISCC".

Combination of ISCC-Code components

ISCC: CCPktvj3dVoVa-CTPCWTpGPMaLZ-CDL6QsUZdZzog

A Fully Qualified ISCC Code is an ordered sequence of Meta-Code, Content-Code, Data-Code, and Instance-Code codes. It SHOULD be prefixed with ISCC and MAY be separated by hyphens.

Fully Qualified ISCC-Code (52 characters)

ISCC: CCDFPFc87MhdTCTWAGYJ9HZGj1CDhydSjutScgECR4GZ8SW5a7uc

Fully Qualified ISCC-Code with hyphens (55 characters)

ISCC: CCDFPFc87MhdT-CTWAGYJ9HZGj1-CDhydSjutScgE-CR4GZ8SW5a7uc

Component Types#

Each component has the same basic structure of a 1-byte header and a 8-byte body section.

The 1-byte header of each component is subdivided into 2 nibbles (4 bits). The first nibble specifies the component type while the second nibble is component specific.

The header only needs to be carried in the encoded representation. As similarity searches across different components are of little use, the type-information contained in the header of each component can be safely ignored after an ISCC has been decomposed and internally typed by an application.

List of Component Headers#

Component	Nibble-1	Nibble-2	Byte	Code
Meta-Code	0000	0000 - Reserved	0x00	CC
Content-Code-Text	0001	0000 - Content Type Text	0x10	CT
Content-Code-Text PCF	0001	0001 - Content Type Text + PCF	0x11	Ct
Content-Code-Image	0001	0010 - Content Type Image	0x12	CY
Content-Code-Image PCF	0001	0011 - Content Type Image + PCF	0x13	Ci
Content-Code-Audio	0001	0100 - Content Type Audio	0x14	CA
Content-Code-Audio PCF	0001	0101 - Content Type Audio + PCF	0x15	Ca
Content-Code-Video	0001	0110 - Content Type Video	0x16	CV
Content-Code-Video PCF	0001	0111 - Content Type Video + PCF	0x17	Cv
Content-Code-Mixed	0001	1000 - Content Type Mixed	0x18	CM
Content-Code Mixed PCF	0001	1001 - Content Type Mixed + PCF	0x19	Cm
Data-Code	0010	0000 - Reserved	0x20	CD
Instance-Code	0011	0000 - Reserved	0x30	CR

The body section of each component is specific to the component and always 8-bytes and can thus be fit into a 64-bit integer for efficient data processing. The following sections give an overview of how the different components work and how they are generated.

Meta-Code Component#

The Meta-Code component starts with a 1-byte header 00000000. The first nibble 0000 indicates that this is a Meta-Code component type. The second nibble is reserved for future extended features of the Meta-Code.

The Meta-Code body is built from a 64-bit similarity_hash over 4-character n-grams of the basic metadata of the content to be identified. The basic metadata supplied to the Meta-Code generating function is assumed to be UTF-8 encoded. Errors that occur during the decoding of such a byte string input to a native Unicode MUST terminate the process and must not be silenced. An ISCC generating application MUST provide a meta_id function that accepts minimal and generic metadata and returns a Base58-ISCC encoded Meta-Code component and trimmed metadata.

Inputs to Meta-Code function#

Name	Type	Required	Description
title	text	Yes	The title of an intangible creation.
extra	text	No	An optional short statement that distinguishes this intangible creation from another one for forced Meta-Code uniqueness. (default: empty string)

Note

The basic metadata inputs are intentionally simple and generic. We abstain from more specific metadata for Meta-Code generation in favor of compatibility across industries. To support global clustering, it is RECOMMENDED to only supply the title field for Meta-Code generation. Imagine a creator input-field for metadata. Who would you list as the creators of a movie? The directors, writers, the main actors? Would you list some of them or if not how do you decide whom you will list. Global disambiguation of similar title data can be accomplished with the extra-field. Industry- and application-specific metadata requirements can be met by extended metadata.

Generate Meta-Code#

An ISCC generating application must follow these steps in the given order to produce a stable Meta-Code:

Apply text_normalize separately to the title and extra inputs while keeping white space.
Apply text_trim to the results of the previous step. The results of this step MUST be supplied as basic metadata for ISCC registration.
Concatenate trimmed title and extra from using a space ( \u0020) as a separator. Remove leading/trailing whitespace.
Create a list of 4 character n-grams by sliding character-wise through the result of the previous step.
Encode each n-gram from the previous step to an UTF-8 bytestring and calculate its xxHash64 digest.
Apply similarity_hash to the list of digests from the previous step.
Prepend the 1-byte component header (0x00) to the results of the previous step.
Encode the resulting 9 byte sequence with encode
Return encoded Meta-Code, trimmed title and trimmed extra data.

Dealing with Meta-Code collisions#

Ideally we want multiple ISCCs that identify different manifestations of the same intangible creation to be automatically grouped by an identical leading Meta-Code component. We call such a natural grouping an intended component collision. Metadata, captured and edited by humans, is notoriously unreliable. By using normalization and a similarity hash on the metadata, we account for some of this variation while keeping the Meta-Code component somewhat stable.

Auto-generated Meta-Codes components are expected to miss some intended collisions. An application SHOULD check for such missed intended component collisions before registering a new Meta-Code with the canonical registry of ISCCs by conducting a similarity search and asking for user feedback.

But what about unintended component collisions? Such collisions might happen because two different intangible creations have very similar or even identical metadata. But they might also happen by chance. With 2^56 possible Meta-Code components the probability of random collisions rises in an S-curved shape with the number of deployed ISCCs (see: Hash Collision Probabilities). We should keep in mind that the Meta-Code component is only one part of a fully qualified ISCC Code. Unintended collisions of the Meta-Code component are generally deemed as acceptable and expected.

If for any reason an application wants to avoid unintended collisions with pre-existing Meta-Code components, it may use the extra-field. An application MUST first generate a Meta-Code without asking the user for input to the extra-field and then first check for collisions with the canonical registry of ISCCs. After it finds a collision with a pre-existing Meta-Code it may display the metadata of the colliding entry and interact with the user to determine if it is an unintended collision. Only if the user indicates an unintended collision, may the application ask for a disambiguation that is then added as an amendment to the metadata via the extra-field to create a different Meta-Code component. The application may repeat the pre-existence check until it finds no collision or a user intended collision. The application MUST NOT supply auto-generated input to the extra-field.

It is our opinion that the concept of intended collisions of Meta-Code components is a useful concept and a net positive. But one must know that this characteristic also has its pitfalls. It is not an attempt to provide an unambiguous - agreed upon - definition of "identical intangible creations".

Content-Code Component#

The Content-Code component has multiple subtypes. The subtypes correspond with the Generic Media Types (GMT). A fully qualified ISCC can only have one Content-Code component of one specific GMT, but there may be multiple ISCCs with different Content-Code types per digital media object.

A Content-Code is generated in two broad steps. In the first step, we extract and convert content from a rich media type to a normalized GMT. In the second step, we use a GMT-specific process to generate the Content-Code component of an ISCC.

Generic Media Types#

The Content-Code type is signaled by the first 3 bits of the second nibble of the first byte of the Content-Code:

Content-Code Type	Nibble 2 Bits 0-3	Description
text	000	Generated from extracted and normalized plain-text
image	001	Generated from normalized grayscale pixel data
audio	010	To be defined in a later version of the specification
video	011	To be defined in a later version of the specification
mixed	100	Generated from multiple Content-Codes
	101, 110, 111	Reserved for future versions of specification

Content-Code-Text#

The Content-Code-Text is built from the extracted plain-text content of an encoded media object. To build a stable Content-Code-Text the plain-text content must first be extracted from the digital media object. It should be extracted in a way that is reproducible. There are many text document formats out in the wild and extracting plain-text from all of them is anything but a trivial task. While text-extraction is out of scope for this specification it is RECOMMENDED, that plain-text content SHOULD be extracted with the open-source Apache Tika v1.23 toolkit, if a generic reproducibility of the Content-Code-Text component is desired.

An ISCC generating application MUST provide a content_id(text, partial=False) function that accepts UTF-8 encoded plain text and a boolean, indicating the partial content flag as input and returns a Content-Code with GMT type text. The procedure to create a Content-Code-Text is:

Apply text_normalize to the text input while removing white-space.
Create character-wise n-grams of length 13 from the normalized text.
Create a list of 32-bit unsigned integer features by applying xxHash32 to the results of the previous step.
Apply minimum_hash to the list of features from the previous step with n=64.
Collect the least significant bits from the 64 MinHash features from the previous step.
Create a 64-bit digest from the collected bits.
Prepend the 1-byte component header (0x10 full content or 0x11 partial content).
Encode and return the resulting 9-byte sequence with encode.

Content-Code-Image#

For the Content-Code-Image we are opting for a DCT-based perceptual image-hash instead of a more sophisticated key-point detection based method. In view of the generic deployability of the ISCC we chose an algorithm that has moderate computation requirements and is easy to implement while still being robust against common image manipulations.

An ISCC generating application MUST provide a content_id_image(image, partial=False) function that accepts a local file path to an image and returns a Content-Code with GMT type image. The procedure to create a Content-Code-Image is as follows:

Apply image_normalize to receive a two-dimensional array of gray-scale pixel data.
Apply image_hash to the results of the previous step.
Prepend the 1-byte component header (0x12 full content or 0x13 partial content) to results of the previous step.
Encode and return the resulting 9-byte sequence with encode

Image Data Input

The content_id_image function may optionally accept the raw byte data of an encoded image or an internal native image object as input for convenience.

JPEG Decoding

Decoding of JPEG images is non deterministic. Different image processing libraries may yield diverging pixel data and result in different Image-IDs. The reference implementation uses the built-in decoder of the Python Pillow imaging library. Future versions of the ISCC specification may define a custom deterministic JPEG decoding procedure.

Content-Code-Mixed#

The Content-Code-Mixed aggregates multiple Content-Codes of the same or different types. It may be used for digital media objects that embed multiples types of media or for collections of contents of the same type. First, we have to collect contents from the mixed media object or content collection and generate Content-Codes for each item. An ISCC conforming application must provide a content_id_mixed function that takes a list of Content-Code Codes as input and returns a Content-Code-Mixed. Follow these steps to create a Content-Code-Mixed:

Signature: conent_id_mixed(cids: List[str], partial: bool=False) -> str

Decode the list of Content-Codes.
Extract the first 8-bytes from each digest (Note: this includes the header part of the Content-Codes).
Apply similarity_hash to the list of digests from step 2.
Prepend the 1-byte component header(0x18 full content or 0x19 partial content)
Apply encode to the result of step 5 and return the result.

Partial Content Flag (PCF)#

The last bit of the header byte of the Content-Code is the "Partial Content Flag". It designates if the Content-Code applies to the full content, or just some part of it. The PCF MUST be set as a 0-bit (full GMT-specific content) by default. Setting the PCF to 1 enables applications to create multiple linked ISCCs of partial extracts of a content collection. The exact semantics of partial content are outside of the scope of this specification. Applications that plan to support partial Content-Codes MUST define their semantics.

Partial Contant Flag

PCF Linking Example

Let's assume we have a single newspaper issue "The Times - 03 Jan 2009". You would generate one Meta-Code component with the title "The Times" and extra "03 Jan 2009". The resulting Meta-Code component will be the grouping prefix in this scenario.

We use a Content-Code-Mixed with PCF 0 (not partial) for the ISCC of the newspaper issue. We generate Data-Code and Instance-Code from the print PDF of the newspaper issue.

To create an ISCC for a single extracted image that should convey context with the newspaper issue, we reuse the Meta-Code of the newspaper issue and create a Content-Code-Image with PCF 1 (partial to the newspaper issue). For the Data-Code or Instance-Code of the image, we are free to choose if we reuse those of the newspaper issue or create separate ones. The former would express strong specialization of the image to the newspaper issue (not likely to be useful out of context). The latter would create a stronger link to an eventual standalone ISCC of the image. Note that the ISCC of the individual image keeps links in both ways:

Image is linked to the newspaper issue by identical Meta-Code component
Image is linked to the standalone version of the image by identical Content-Code-Image body

This is just one example that illustrates the flexibility that the PCF-Flag provides in concert with a grouping Meta-Code. With great flexibility comes great danger of complexity. Applications SHOULD do careful planning before using the PCF-Flag with internally defined semantics.

Data-Code Component#

For the Data-Code that encodes data similarity we use a content defined chunking algorithm that provides some shift resistance and calculate the MinHash from those chunks. To accommodate for small files, the first 100 chunks have a ~140-byte size target while the remaining chunks target ~ 6kb in size.

The Data-Code is built from the raw encoded data of the content to be identified. An ISCC generating application MUST provide a data_id function that accepts the raw encoded data as input.

Generate Data-Code#

Apply data_chunks to the raw encoded content data.
For each chunk, calculate the xxHash32 integer hash.
Apply minimum_hash to the resulting list of 32-bit unsigned integers with n=64.
Collect the least significant bits from the 64 MinHash features.
Create a 64-bit digest from the collected bits.
Prepend the 1-byte component header (eg. 0x20).
Apply encode to the result of step 6 and return the result.

Instance-Code Component#

The Instance-Code is built from the raw data of the media object to be identified and serves as checksum for the media object. The raw data of the media object is split into 64-kB data-chunks. Then we build a hash-tree from those chunks and use the truncated tophash (Merkle root) as component-body of the Instance-Code.

To guard against length-extension attacks and second preimage attacks, we use double sha256 for hashing. We also prefix the hash input data with a 0x00-byte for the leaf nodes hashes and with a 0x01-byte for the internal node hashes. While the Instance-Code itself is a non-cryptographic checksum, the full tophash may be supplied in the extended metadata of an ISCC secure integrity verification is required.

iscc-creation-Instance-Code

An ISCC generating application MUST provide a instance_id function that accepts the raw data file as input and returns an encoded Instance-Code and a full hex-encoded 256-bit tophash.

Generate Instance-Code#

Split the raw bytes of the encoded media object into 64-kB chunks.
For each chunk, calculate the sha256d of the concatenation of a 0x00-byte and the chunk bytes. We call the resulting values leaf node hashes (LNH).
Calculate the next level of the hash tree by applying sha256d to the concatenation of a 0x01-byte and adjacent pairs of LNH values. If the length of the list of LNH values is uneven, concatenate the last LNH value with itself. We call the resulting values internal node hashes (INH).
Recursively apply 0x01-prefixed pair-wise hashing to the results of the last step until the process yields only one hash value. We call this value the tophash.
Trim the resulting tophash to the first 8 bytes.
Prepend the 1-byte component header (e.g. 0x30).
Encode resulting 9-byte sequence with encode to an Instance-Code Code
Hex-Encode the tophash
Return the Instance-Code and the hex-encoded tophash

ISCC Metadata#

As a generic content identifier, the ISCC makes minimal assumptions about metadata that must or should be supplied together with an ISCC. The RECOMMENDED data-interchange format for ISCC metadata is JSON. We distinguish between Basic Metadata and Extended Metadata:

Basic Metadata#

Basic metadata for an ISCC is metadata that is explicitly defined by this specification. The following table enumerates basic metadata fields for the top-level of the JSON metadata object:

Name	Type	Required	Bound	Description
version	integer	No	No	Version of ISCC Specification. Assumed to be 1 if omitted.
title	text	Yes	Yes	The title of an intangible creation identified by the ISCC. The normalized and trimmed UTF-8 encoded text MUST not exceed 128 bytes. The result of processing `title` and `extra` data with the `meta_id` function MUST match the Meta-Code component of the ISCC.
extra	text	No	Yes	An optional short statement that distinguishes this intangible creation from another one for Meta-Code uniqueness.
tophash	text (hex)	No	No	The full hex-encoded tophash (Merkle root) returned by the `instance_id` function.
meta	array	No	No	A list of one or more extended metadata entries. Must include at least one entry if specified.

Attention

Bound metadata impacts the ISCC Code (Meta-Code) and cannot be changed afterwards. Depending on adoption and real world use, future versions of this specification may define new basic metadata fields. Applications MAY add custom fields at the top level of the JSON object, but MUST prefix those fields with an underscore to avoid collisions with future extensions of this specification.

Extended Metadata#

Extended metadata for an ISCC is metadata that is not explicitly defined by this specification. All such metadata SHOULD be supplied as JSON objects within the top-level meta-array field. This allows for a flexible and extendable way to supply additional industry specific metadata about the identified content.

Extended metadata entries MUST be wrapped in JSON object of the following structure:

Name	Description
schema	The `schema`-field may indicate a well-known metadata schema (such as Dublin Core, IPTC, ID3v2, ONIX) that is used. RECOMMENDED `schema`: "schema.org"
mediatype	The `mediatype`-field specifies an IANA Media Type. RECOMMENDED `mediatype`: "application/ld+json"
url	An URL that is expected to host the metadata with the indicated `schema` and `mediatype`. This field is only required if the `data`-field is omitted.
data	The `data`-field holds the metadata conforming to the indicated `schema` and `mediatype.` It is only required if the `url` field is omitted.

ISCC Registration#

The ISCC is a decentralized identifier. ISCCs can be generated for content by anybody who has access to the content. Because of the clustering properties of its components, the ISCC provides utility in data interchange and de-duplication scenarios even without a global registry. There is no central authority for the registration of ISCC codes or certification of content authorship.

As an open system, the ISCC allows any person or organization to offer ISCC registration services as they see fit and without the need to ask anyone for permission. This also presumes that no person or organization may claim exclusive authority about ISCC registration.

Blockchain Registry#

To properly address the questions of identifier uniqueness, ownership and authentication within the ISCC Standard, the assignment of a set of canonical blockchain is a requirement. The distributed nature of blockchains are a perfect fit for long-term persistent identifier registration and resolver services.

The assignment of a set of canonical blockchains is NOT YET part of this specification.

Because this decision is of such vital importance, we suggest waiting for further feedback and additional community involvement before we address these questions either in an updated version of this specification or in a separate specification.

Our recommendation to the community is to agree on a set of decentralized, open, and public blockchains that have specific support for registry-services. This would maximize the value for all participants in the ecosystem. Governance and protocol related questions are being worked on by many projects.

ISCC Embedding#

Embedding ISCC codes into content is only RECOMMENDED if it does not create a side effect. We call it a side effect if embedding an ISCC code changes the content to such an extent, that it yields a different ISCC code.

Side effects will depend on the combination of ISCC components that are to be embedded. A Meta-Code can always be embedded without side effects because it does not depend on the content itself. Content-Code and Data-Code may not change if embedded in larger media objects. Instance-Codes cannot easily be embedded as they will inevitably have a side effect on the post-embedding Instance-Code without special processing.

Applications MAY embed ISCC codes that have side effects if they specify a procedure by which the embedded ISCC codes can be stripped in such a way that the stripped content will yield the original embedded ISCC codes.

ISCC Embedding

We can embed the following combination of components from the markdown version of this document into the document itself because adding or removing them has no side effect:

ISCC: CCDbMYw6NfC8a-CTtW9UFozcmBJ-CDYJsRdBNAERM

ISCC URI Scheme#

Provisional Section

The ISCC URI Scheme and link-resolver details ultimately depend on identifier registration, ownership, uniqueness and governance related decisions which are not yet part of this specification. See also: Blockchain Registry.

The purpose of the ISCC URI scheme based on RFC 3986 is to enable users to discover information like metadata or license offerings from an ISCC marked content by clicking a link on a webpage or by scanning a QR-Code.

The scheme name is iscc. The path component MUST be a fully qualified ISCC Code without hyphens. An optional stream query key MAY indicate the blockchain stream information source. If the stream query key is omitted, applications SHOULD return information from the open ISCC Stream.

The scheme name component ("iscc:") is case-insensitive. Applications MUST accept any combination of uppercase and lowercase letters in the scheme name. All other URI components are case-sensitive.

Applications MAY register themselves as a handler for the "iscc:" URI scheme if no other handler is already registered. If another handler is already registered, an application MAY ask the user to change it on the first run of the application.

URI Syntax#

<foo> means placeholder, [bar] means optional.

iscc:<fq-iscc-code>[?stream=<name>]

URI Example#

iscc:11TcMGvUSzqoM1CqVA3ykFawyh1R1sH4Bz8A1of1d2Ju4VjWt26S?stream=smart-license

Procedures & Algorithms#

Base58-ISCC#

The ISCC uses a custom per-component data encoding similar to the zbase62 encoding by Zooko Wilcox-O'Hearn but with a 58-character symbol table. The encoding does not require padding and will always yield component codes of 13 characters length for 72-bit component digests. The predictable size of the encoding is a property that allows for easy composition and decomposition of components without having to rely on a delimiter (hyphen) in the ISCC code representation. Colliding body segments of the digest are preserved by encoding the header and body separately. The ASCII symbol table also minimizes transcription and OCR errors by omitting the easily confused characters 'O', '0', 'I', 'l' and is shuffled to generate human readable component headers.

Symbol table

SYMBOLS = "C23456789rB1ZEFGTtYiAaVvMmHUPWXKDNbcdefghLjkSnopRqsJuQwxyz"

encode#

Signature: encode(digest: bytes) -> str

The encode function accepts a 9-byte ISCC Component Digest and returns the Base58-ISCC encoded alphanumeric string of 13 characters, which we call the ISCC-Component Code.

decode#

Signature: decode(code: str) -> bytes

the decode function accepts a 13-character ISCC-Component Code and returns the corresponding 9-byte ISCC-Component Digest.

Content Normalization#

The ISCC standardizes some content normalization procedures to support reproducible and stable identifiers. Following the list of normalization functions that MUST be provided by a conforming implementation.

text_trim#

Signature: text_trim(text: str) -> str

Trim text such that its UTF-8 encoded byte representation does not exceed 128-bytes each. Remove leading and trailing whitespace.

text_normalize#

Signature: text_normalize(text: str, keep_ws: bool = False) -> str

We define a text normalization function that is specific to our application. It takes text and an optional boolean keep_ws parameter as an input and returns normalized Unicode text for further algorithmic processing. The text_normalize function performs the following operations in the given order while each step works with the results of the previous operation:

Decode to native Unicode if the text is a byte string
Remove leading and trailing whitespace
Transform text to lowercase
Decompose the lower case text by applying Unicode Normalization Form D (NFD).
Filter out all characters that fall into the Unicode categories listed in the constant UNICODE_FILTER. Keep these control characters (Cc) that are commonly considered white-space:
- \u0009, # Horizontal Tab (TAB)
- \u000A, # Linefeed (LF)
- \u000D, # Carriage Return (CR)
Keep or remove whitespace depending on the keep_ws parameter
Re-Combine the text by applying Unicode Normalization Form KC (NFKC).

image_normalize#

Signature: image_normalize(img) -> List[List[int]]

Accepts a file path, byte-stream or raw binary image data and MUST at least support JPEG, PNG, and GIF image formats. Normalize the image with the following steps:

Convert the image to grayscale
Resize the image to 32x32 pixels using bicubic interpolation
Create a 32x32 two-dimensional array of 8-bit gray-scale values from the image data

Feature Hashing#

The ISCC standardizes various feature hashing algorithms that reduce content features to a binary vector used as the body of the various Content-Code components.

similarity_hash#

Signature: similarity_hash(hash_digests: Sequence[ByteString]) -> bytes

The similarity_hash function takes a sequence of hash digests that represent a set of features. Each of the digests MUST be of equal size. The function returns a new hash digest (raw 8-Bit bytes) of the same size. For each bit in the input-hashes calculate the number of hashes with that bit set and subtract the count of hashes where it is not set. For the output-hash set the same bit position to 0 if the count is negative or 1 if it is zero or positive. The resulting hash digest will retain similarity for similar sets of input hashes. See also [Charikar2002].

iscc-similarity-hash

minimum_hash#

Signature: minimum_hash(features: Iterable[int], n: int = 64) -> List[int]

The minimum_hash function takes an arbitrary-sized set of 32-bit integer features and reduces it to a fixed size vector of n features such that it preserves similarity with other sets. It is based on the MinHash implementation of the datasketch library by Eric Zhu.

image_hash#

Signature: image_hash(pixels: List[List[int]]) -> bytes

Perform a discrete cosine transform per row of input pixels.
Perform a discrete cosine transform per column on the resulting matrix from step 2.
Extract upper left 8x8 corner of the array from step 2 as a flat list.
Calculate the median of the results from step 3.
Create a 64-bit digest by iterating over the values of step 5 and setting a 1- for values above median and 0 for values below or equal to the median.
Return results from step 5.

Content Defined Chunking#

For shift resistant data chunking, the ISCC requires a custom chunking algorithm:

data_chunks#

Signature: data_chunks(data: stream) -> Iterator[bytes]

The data_chunks function accepts a byte-stream and returns variable sized chunks. Chunk boundaries are determined by a gear based chunking algorithm based on [WenXia2016].

Conformance Testing#

An application that claims ISCC conformance MUST pass all required functions from the ISCC conformance test suite. The test suite is available as JSON data in our Github Repository. Test data is structured as follows:

{
    "<function_name>": {
        "required": true,
        "<test_name>": {
            "inputs": ["<value1>", "<value2>"],
            "outputs": ["value1>", "<value2>"]
        }
    }
}

The test suite also contains data for functions that are considered implementation details and MAY be skipped by other implementations. Optional tests are marked as "required": false.

Outputs that are expected to be raw bytes are embedded as HEX encoded strings in JSON and prefixed with hex: to support automated decoding during implementation testing.

Example

Byte outputs in JSON test data:

{
  "data_chunks": {
    "test_001_cat_jpg": {
      "inputs": ["cat.jpg"],
      "outputs": ["hex:ffd8ffe1001845786966000049492a0008", ...]
    }
  }
}

License#

This work is licensed under a Creative Commons (CC BY-NC-SA 4.0).

Last update: 2022-10-10 16:43:08