ISCC - Specification v1.x#
Last revised: 2022-10-10 16:43:08
Warning
This document is an out of date early draft and retained for historic reasons only. Please follow current development at https://ieps.iscc.codes
Abstract#
The International Standard Content Code (ISCC), is an open and decentralized digital media identifier. An ISCC can be created from digital content and its basic metadata by anybody who follows the procedures of the ISCC specification or by using open source software that supports ISCC creation conforming to the ISCC specification.
Note to Readers#
For public discussion of issues for this specification, please use the Github issue tracker: https://github.com/iscc/iscc-specs/issues.
If you want to chat with developers, join us on Telegram at https://t.me/iscc_dev.
You can find the latest version of this specification at http://iscc.codes/specification/.
Public review, discussion and contributions are welcome.
About this Document#
Document Version
While there is already a Version 1.0 spec, we are still expecting backward incompatible changes until Version 2.0. Parts of this specification may become stable earlier. We will document this during minor releases. We encourage partners to follow development and test, implement, and give feedback based on the latest (this) version of the ISCC Specification.
This document proposes an open and vendor neutral ISCC standard and describes the technical procedures to create and manage ISCC identifiers. The first version of this document resulted from a prototyping project by the Content Blockchain Project and received funding from the Google Digital News Initiative (DNI). The content of this document results from a voluntary effort of the authors with an open and public consensus process.
Conventions and Terminology#
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119].
Definitions#
- Basic Metadata:
- Minimal set of ISCC-specific top-level metadata that applications SHOULD support.
- Bound Metadata:
- Metadata that is utilized during generation of the ISCC. A change of this bound metadata may therefore impact the derived ISCC..
- Extended Metadata:
- Industry and application-specific metadata attached to an ISCC.
- Character:
- Throughout this specification a character is meant to be interpreted as one Unicode code point. This also means that due to the structure of Unicode a character is not necessarily a full glyph but might be a combining accent or similar.
- Digital Media Object:
- A blob of raw bytes with some media type specific encoding.
- Extended Metadata:
- Metadata that is not encoded within the ISCC Meta-Code but may be supplied together with the ISCC.
- Generic Media Type:
- A basic content type such as plain text in a normalized and generic (UTF-8) encoding format.
- ISCC:
- International Standard Content Code
- ISCC Code:
- The printable text encoded representation of an ISCC
- ISCC Digest:
- The raw binary data of an ISCC
Introduction#
An ISCC permanently identifies content at multiple levels of granularity. It is algorithmically generated from basic metadata and the contents of a digital media object. It is designed for being registered and stored on a public and decentralized blockchain. An ISCC for a media object can be created and registered by the content author, a publisher, a service provider or anybody else. By itself the ISCC and its basic registration on a blockchain does not make any statement or claim about authorship or ownership of the identified content.
ISCC Structure#
A Fully Qualified ISCC Digest is a fixed size sequence of 36 bytes (288 bits) assembled from multiple sub-components. The Fully Qualified ISCC Code is a 52 character encoded printable string representation of a complete ISCC Digest. This is a high-level overview of the ISCC creation process:
ISCC Components#
The ISCC Digest is built from multiple self-describing 72-bit components:
Components: | Meta-Code | Content-Code | Data-Code | Instance-Code |
---|---|---|---|---|
Context: | Intangible creation | Content similarity | Data similarity | Data checksum |
Input: | Metadata | Extracted content | Raw data | Raw data |
Algorithms: | Similarity Hash | Type specific | CDC, Minimum Hash | Hash Tree |
Size: | 72 bits | 72 bits | 72 bits | 72 bits |
ISCC components MAY be used separately or in combination by applications for various purposes. Individual components MUST be presented as 13-character base58-iscc encoded strings to end users and MAY be prefixed with their component name.
Single component ISCC-Code (13 characters)
Meta-Code: CCDFPFc87MhdT
Combinations of components MUST include the Meta-Code component and MUST be ordered as Meta-Code, Content-Code, Data-Code, and Instance-Code. Individual components MAY be skipped and SHOULD be separated with hyphens. A combination of components SHOULD be prefixed with "ISCC".
Combination of ISCC-Code components
ISCC: CCPktvj3dVoVa-CTPCWTpGPMaLZ-CDL6QsUZdZzog
A Fully Qualified ISCC Code is an ordered sequence of Meta-Code, Content-Code, Data-Code, and Instance-Code codes. It SHOULD be prefixed with ISCC and MAY be separated by hyphens.
Fully Qualified ISCC-Code (52 characters)
ISCC: CCDFPFc87MhdTCTWAGYJ9HZGj1CDhydSjutScgECR4GZ8SW5a7uc
Fully Qualified ISCC-Code with hyphens (55 characters)
ISCC: CCDFPFc87MhdT-CTWAGYJ9HZGj1-CDhydSjutScgE-CR4GZ8SW5a7uc
Component Types#
Each component has the same basic structure of a 1-byte header and a 8-byte body section.
The 1-byte header of each component is subdivided into 2 nibbles (4 bits). The first nibble specifies the component type while the second nibble is component specific.
The header only needs to be carried in the encoded representation. As similarity searches across different components are of little use, the type-information contained in the header of each component can be safely ignored after an ISCC has been decomposed and internally typed by an application.
List of Component Headers#
Component | Nibble-1 | Nibble-2 | Byte | Code |
---|---|---|---|---|
Meta-Code | 0000 | 0000 - Reserved | 0x00 | CC |
Content-Code-Text | 0001 | 0000 - Content Type Text | 0x10 | CT |
Content-Code-Text PCF | 0001 | 0001 - Content Type Text + PCF | 0x11 | Ct |
Content-Code-Image | 0001 | 0010 - Content Type Image | 0x12 | CY |
Content-Code-Image PCF | 0001 | 0011 - Content Type Image + PCF | 0x13 | Ci |
Content-Code-Audio | 0001 | 0100 - Content Type Audio | 0x14 | CA |
Content-Code-Audio PCF | 0001 | 0101 - Content Type Audio + PCF | 0x15 | Ca |
Content-Code-Video | 0001 | 0110 - Content Type Video | 0x16 | CV |
Content-Code-Video PCF | 0001 | 0111 - Content Type Video + PCF | 0x17 | Cv |
Content-Code-Mixed | 0001 | 1000 - Content Type Mixed | 0x18 | CM |
Content-Code Mixed PCF | 0001 | 1001 - Content Type Mixed + PCF | 0x19 | Cm |
Data-Code | 0010 | 0000 - Reserved | 0x20 | CD |
Instance-Code | 0011 | 0000 - Reserved | 0x30 | CR |
The body section of each component is specific to the component and always 8-bytes and can thus be fit into a 64-bit integer for efficient data processing. The following sections give an overview of how the different components work and how they are generated.
Meta-Code Component#
The Meta-Code component starts with a 1-byte header 00000000
. The first nibble 0000
indicates that this is a Meta-Code component type. The second nibble is reserved for future extended features of the Meta-Code.
The Meta-Code body is built from a 64-bit similarity_hash
over 4-character n-grams of the basic metadata of the content to be identified. The basic metadata supplied to the Meta-Code generating function is assumed to be UTF-8 encoded. Errors that occur during the decoding of such a byte string input to a native Unicode MUST terminate the process and must not be silenced. An ISCC generating application MUST provide a meta_id
function that accepts minimal and generic metadata and returns a Base58-ISCC encoded Meta-Code component and trimmed metadata.
Inputs to Meta-Code function#
Name | Type | Required | Description |
---|---|---|---|
title | text | Yes | The title of an intangible creation. |
extra | text | No | An optional short statement that distinguishes this intangible creation from another one for forced Meta-Code uniqueness. (default: empty string) |
Note
The basic metadata inputs are intentionally simple and generic. We abstain from more specific metadata for Meta-Code generation in favor of compatibility across industries. To support global clustering, it is RECOMMENDED to only supply the title field for Meta-Code generation. Imagine a creator input-field for metadata. Who would you list as the creators of a movie? The directors, writers, the main actors? Would you list some of them or if not how do you decide whom you will list. Global disambiguation of similar title data can be accomplished with the extra-field. Industry- and application-specific metadata requirements can be met by extended metadata.
Generate Meta-Code#
An ISCC generating application must follow these steps in the given order to produce a stable Meta-Code:
- Apply
text_normalize
separately to thetitle
andextra
inputs while keeping white space. - Apply
text_trim
to the results of the previous step. The results of this step MUST be supplied as basic metadata for ISCC registration. - Concatenate trimmed
title
andextra
from using a space (\u0020
) as a separator. Remove leading/trailing whitespace. - Create a list of 4 character n-grams by sliding character-wise through the result of the previous step.
- Encode each n-gram from the previous step to an UTF-8 bytestring and calculate its xxHash64 digest.
- Apply
similarity_hash
to the list of digests from the previous step. - Prepend the 1-byte component header (
0x00
) to the results of the previous step. - Encode the resulting 9 byte sequence with
encode
- Return encoded Meta-Code, trimmed
title
and trimmedextra
data.
See also: Meta-Code reference code
Text trimming
When trimming text be sure to trim the byte-length of the UTF-8 encoded version and not the number of characters. The trim point MUST be such that it does not cut into multi-byte characters. Characters might have different UTF-8 byte-length. For example ü
is 2-bytes, 驩
is 3-bytes and 𠜎
is 4-bytes. So the trimmed version of a string with 128 驩
-characters will result in a 42-character string with a 126-byte UTF-8 encoded length. This is necessary because the results of this operation will be stored as basic metadata with strict byte size limits on the blockchain.
Automated Data-Ingestion
Applications that perform automated data-ingestion SHOULD apply a customized preliminary normalization to title data tailored to the dataset. Depending on catalog data removing pairs of brackets [], (), {}, and text in between them or cutting all text after the first occurence of a semicolon (;) or colon (:) can vastly improve deduplication.
Dealing with Meta-Code collisions#
Ideally we want multiple ISCCs that identify different manifestations of the same intangible creation to be automatically grouped by an identical leading Meta-Code component. We call such a natural grouping an intended component collision. Metadata, captured and edited by humans, is notoriously unreliable. By using normalization and a similarity hash on the metadata, we account for some of this variation while keeping the Meta-Code component somewhat stable.
Auto-generated Meta-Codes components are expected to miss some intended collisions. An application SHOULD check for such missed intended component collisions before registering a new Meta-Code with the canonical registry of ISCCs by conducting a similarity search and asking for user feedback.
But what about unintended component collisions? Such collisions might happen because two different intangible creations have very similar or even identical metadata. But they might also happen by chance. With 2^56 possible Meta-Code components the probability of random collisions rises in an S-curved shape with the number of deployed ISCCs (see: Hash Collision Probabilities). We should keep in mind that the Meta-Code component is only one part of a fully qualified ISCC Code. Unintended collisions of the Meta-Code component are generally deemed as acceptable and expected.
If for any reason an application wants to avoid unintended collisions with pre-existing Meta-Code components, it may use the extra
-field. An application MUST first generate a Meta-Code without asking the user for input to the extra
-field and then first check for collisions with the canonical registry of ISCCs. After it finds a collision with a pre-existing Meta-Code it may display the metadata of the colliding entry and interact with the user to determine if it is an unintended collision. Only if the user indicates an unintended collision, may the application ask for a disambiguation that is then added as an amendment to the metadata via the extra
-field to create a different Meta-Code component. The application may repeat the pre-existence check until it finds no collision or a user intended collision. The application MUST NOT supply auto-generated input to the extra
-field.
It is our opinion that the concept of intended collisions of Meta-Code components is a useful concept and a net positive. But one must know that this characteristic also has its pitfalls. It is not an attempt to provide an unambiguous - agreed upon - definition of "identical intangible creations".
Content-Code Component#
The Content-Code component has multiple subtypes. The subtypes correspond with the Generic Media Types (GMT). A fully qualified ISCC can only have one Content-Code component of one specific GMT, but there may be multiple ISCCs with different Content-Code types per digital media object.
A Content-Code is generated in two broad steps. In the first step, we extract and convert content from a rich media type to a normalized GMT. In the second step, we use a GMT-specific process to generate the Content-Code component of an ISCC.
Generic Media Types#
The Content-Code type is signaled by the first 3 bits of the second nibble of the first byte of the Content-Code:
Content-Code Type | Nibble 2 Bits 0-3 | Description |
---|---|---|
text | 000 | Generated from extracted and normalized plain-text |
image | 001 | Generated from normalized grayscale pixel data |
audio | 010 | To be defined in a later version of the specification |
video | 011 | To be defined in a later version of the specification |
mixed | 100 | Generated from multiple Content-Codes |
101, 110, 111 | Reserved for future versions of specification |
Content-Code-Text#
The Content-Code-Text is built from the extracted plain-text content of an encoded media object. To build a stable Content-Code-Text the plain-text content must first be extracted from the digital media object. It should be extracted in a way that is reproducible. There are many text document formats out in the wild and extracting plain-text from all of them is anything but a trivial task. While text-extraction is out of scope for this specification it is RECOMMENDED, that plain-text content SHOULD be extracted with the open-source Apache Tika v1.23 toolkit, if a generic reproducibility of the Content-Code-Text component is desired.
An ISCC generating application MUST provide a content_id(text, partial=False)
function that accepts UTF-8 encoded plain text and a boolean, indicating the partial content flag as input and returns a Content-Code with GMT type text
. The procedure to create a Content-Code-Text is:
- Apply
text_normalize
to the text input while removing white-space. - Create character-wise n-grams of length 13 from the normalized text.
- Create a list of 32-bit unsigned integer features by applying xxHash32 to the results of the previous step.
- Apply
minimum_hash
to the list of features from the previous step with n=64. - Collect the least significant bits from the 64 MinHash features from the previous step.
- Create a 64-bit digest from the collected bits.
- Prepend the 1-byte component header (
0x10
full content or0x11
partial content). - Encode and return the resulting 9-byte sequence with
encode
.
See also: Content-Code-Text reference code
Content-Code-Image#
For the Content-Code-Image we are opting for a DCT-based perceptual image-hash instead of a more sophisticated key-point detection based method. In view of the generic deployability of the ISCC we chose an algorithm that has moderate computation requirements and is easy to implement while still being robust against common image manipulations.
An ISCC generating application MUST provide a content_id_image(image, partial=False)
function that accepts a local file path to an image and returns a Content-Code with GMT type image
. The procedure to create a Content-Code-Image is as follows:
- Apply
image_normalize
to receive a two-dimensional array of gray-scale pixel data. - Apply
image_hash
to the results of the previous step. - Prepend the 1-byte component header (
0x12
full content or0x13
partial content) to results of the previous step. - Encode and return the resulting 9-byte sequence with
encode
See also: Content-Code-Image reference code
Image Data Input
The content_id_image
function may optionally accept the raw byte data of an encoded image or an internal native image object as input for convenience.
JPEG Decoding
Decoding of JPEG images is non deterministic. Different image processing libraries may yield diverging pixel data and result in different Image-IDs. The reference implementation uses the built-in decoder of the Python Pillow imaging library. Future versions of the ISCC specification may define a custom deterministic JPEG decoding procedure.
Content-Code-Mixed#
The Content-Code-Mixed aggregates multiple Content-Codes of the same or different types. It may be used for digital media objects that embed multiples types of media or for collections of contents of the same type. First, we have to collect contents from the mixed media object or content collection and generate Content-Codes for each item. An ISCC conforming application must provide a content_id_mixed
function that takes a list of Content-Code Codes as input and returns a Content-Code-Mixed. Follow these steps to create a Content-Code-Mixed:
Signature: conent_id_mixed(cids: List[str], partial: bool=False) -> str
- Decode the list of Content-Codes.
- Extract the first 8-bytes from each digest (Note: this includes the header part of the Content-Codes).
- Apply
similarity_hash
to the list of digests from step 2. - Prepend the 1-byte component header(
0x18
full content or0x19
partial content) - Apply
encode
to the result of step 5 and return the result.
See also: Content-Code-Mixed reference code
Partial Content Flag (PCF)#
The last bit of the header byte of the Content-Code is the "Partial Content Flag". It designates if the Content-Code applies to the full content, or just some part of it. The PCF MUST be set as a 0
-bit (full GMT-specific content) by default. Setting the PCF to 1
enables applications to create multiple linked ISCCs of partial extracts of a content collection. The exact semantics of partial content are outside of the scope of this specification. Applications that plan to support partial Content-Codes MUST define their semantics.
PCF Linking Example
Let's assume we have a single newspaper issue "The Times - 03 Jan 2009". You would generate one Meta-Code component with the title "The Times" and extra "03 Jan 2009". The resulting Meta-Code component will be the grouping prefix in this scenario.
We use a Content-Code-Mixed with PCF 0
(not partial) for the ISCC of the newspaper issue. We generate Data-Code and Instance-Code from the print PDF of the newspaper issue.
To create an ISCC for a single extracted image that should convey context with the newspaper issue, we reuse the Meta-Code of the newspaper issue and create a Content-Code-Image with PCF 1
(partial to the newspaper issue). For the Data-Code or Instance-Code of the image, we are free to choose if we reuse those of the newspaper issue or create separate ones. The former would express strong specialization of the image to the newspaper issue (not likely to be useful out of context). The latter would create a stronger link to an eventual standalone ISCC of the image. Note that the ISCC of the individual image keeps links in both ways:
- Image is linked to the newspaper issue by identical Meta-Code component
- Image is linked to the standalone version of the image by identical Content-Code-Image body
This is just one example that illustrates the flexibility that the PCF-Flag provides in concert with a grouping Meta-Code. With great flexibility comes great danger of complexity. Applications SHOULD do careful planning before using the PCF-Flag with internally defined semantics.
Data-Code Component#
For the Data-Code that encodes data similarity we use a content defined chunking algorithm that provides some shift resistance and calculate the MinHash from those chunks. To accommodate for small files, the first 100 chunks have a ~140-byte size target while the remaining chunks target ~ 6kb in size.
The Data-Code is built from the raw encoded data of the content to be identified. An ISCC generating application MUST provide a data_id
function that accepts the raw encoded data as input.
Generate Data-Code#
- Apply
data_chunks
to the raw encoded content data. - For each chunk, calculate the xxHash32 integer hash.
- Apply
minimum_hash
to the resulting list of 32-bit unsigned integers with n=64. - Collect the least significant bits from the 64 MinHash features.
- Create a 64-bit digest from the collected bits.
- Prepend the 1-byte component header (eg. 0x20).
- Apply
encode
to the result of step 6 and return the result.
See also: Data-Code reference code
Instance-Code Component#
The Instance-Code is built from the raw data of the media object to be identified and serves as checksum for the media object. The raw data of the media object is split into 64-kB data-chunks. Then we build a hash-tree from those chunks and use the truncated tophash (Merkle root) as component-body of the Instance-Code.
To guard against length-extension attacks and second preimage attacks, we use double sha256 for hashing. We also prefix the hash input data with a 0x00
-byte for the leaf nodes hashes and with a 0x01
-byte for the internal node hashes. While the Instance-Code itself is a non-cryptographic checksum, the full tophash may be supplied in the extended metadata of an ISCC secure integrity verification is required.
An ISCC generating application MUST provide a instance_id
function that accepts the raw data file as input and returns an encoded Instance-Code and a full hex-encoded 256-bit tophash.
Generate Instance-Code#
- Split the raw bytes of the encoded media object into 64-kB chunks.
- For each chunk, calculate the sha256d of the concatenation of a
0x00
-byte and the chunk bytes. We call the resulting values leaf node hashes (LNH). - Calculate the next level of the hash tree by applying sha256d to the concatenation of a
0x01
-byte and adjacent pairs of LNH values. If the length of the list of LNH values is uneven, concatenate the last LNH value with itself. We call the resulting values internal node hashes (INH). - Recursively apply
0x01
-prefixed pair-wise hashing to the results of the last step until the process yields only one hash value. We call this value the tophash. - Trim the resulting tophash to the first 8 bytes.
- Prepend the 1-byte component header (e.g.
0x30
). - Encode resulting 9-byte sequence with
encode
to an Instance-Code Code - Hex-Encode the tophash
- Return the Instance-Code and the hex-encoded tophash
See also: Instance-Code reference code
Applications may carry, store, and process the leaf node hashes for advanced streaming data identification or partial data integrity verification.
ISCC Metadata#
As a generic content identifier, the ISCC makes minimal assumptions about metadata that must or should be supplied together with an ISCC. The RECOMMENDED data-interchange format for ISCC metadata is JSON. We distinguish between Basic Metadata and Extended Metadata:
Basic Metadata#
Basic metadata for an ISCC is metadata that is explicitly defined by this specification. The following table enumerates basic metadata fields for the top-level of the JSON metadata object:
Name | Type | Required | Bound | Description |
---|---|---|---|---|
version | integer | No | No | Version of ISCC Specification. Assumed to be 1 if omitted. |
title | text | Yes | Yes | The title of an intangible creation identified by the ISCC. The normalized and trimmed UTF-8 encoded text MUST not exceed 128 bytes. The result of processing title and extra data with the meta_id function MUST match the Meta-Code component of the ISCC. |
extra | text | No | Yes | An optional short statement that distinguishes this intangible creation from another one for Meta-Code uniqueness. |
tophash | text (hex) | No | No | The full hex-encoded tophash (Merkle root) returned by the instance_id function. |
meta | array | No | No | A list of one or more extended metadata entries. Must include at least one entry if specified. |
Attention
Bound metadata impacts the ISCC Code (Meta-Code) and cannot be changed afterwards. Depending on adoption and real world use, future versions of this specification may define new basic metadata fields. Applications MAY add custom fields at the top level of the JSON object, but MUST prefix those fields with an underscore to avoid collisions with future extensions of this specification.
Extended Metadata#
Extended metadata for an ISCC is metadata that is not explicitly defined by this specification. All such metadata SHOULD be supplied as JSON objects within the top-level meta
-array field. This allows for a flexible and extendable way to supply additional industry specific metadata about the identified content.
Extended metadata entries MUST be wrapped in JSON object of the following structure:
Name | Description |
---|---|
schema | The schema -field may indicate a well-known metadata schema (such as Dublin Core, IPTC, ID3v2, ONIX) that is used. RECOMMENDED schema : "schema.org" |
mediatype | The mediatype -field specifies an IANA Media Type. RECOMMENDED mediatype : "application/ld+json" |
url | An URL that is expected to host the metadata with the indicated schema and mediatype . This field is only required if the data -field is omitted. |
data | The data -field holds the metadata conforming to the indicated schema and mediatype. It is only required if the url field is omitted. |
ISCC Registration#
The ISCC is a decentralized identifier. ISCCs can be generated for content by anybody who has access to the content. Because of the clustering properties of its components, the ISCC provides utility in data interchange and de-duplication scenarios even without a global registry. There is no central authority for the registration of ISCC codes or certification of content authorship.
As an open system, the ISCC allows any person or organization to offer ISCC registration services as they see fit and without the need to ask anyone for permission. This also presumes that no person or organization may claim exclusive authority about ISCC registration.
Blockchain Registry#
To properly address the questions of identifier uniqueness, ownership and authentication within the ISCC Standard, the assignment of a set of canonical blockchain is a requirement. The distributed nature of blockchains are a perfect fit for long-term persistent identifier registration and resolver services.
The assignment of a set of canonical blockchains is NOT YET part of this specification.
Because this decision is of such vital importance, we suggest waiting for further feedback and additional community involvement before we address these questions either in an updated version of this specification or in a separate specification.
Our recommendation to the community is to agree on a set of decentralized, open, and public blockchains that have specific support for registry-services. This would maximize the value for all participants in the ecosystem. Governance and protocol related questions are being worked on by many projects.
See also: ISCC-Stream specification.
ISCC Embedding#
Embedding ISCC codes into content is only RECOMMENDED if it does not create a side effect. We call it a side effect if embedding an ISCC code changes the content to such an extent, that it yields a different ISCC code.
Side effects will depend on the combination of ISCC components that are to be embedded. A Meta-Code can always be embedded without side effects because it does not depend on the content itself. Content-Code and Data-Code may not change if embedded in larger media objects. Instance-Codes cannot easily be embedded as they will inevitably have a side effect on the post-embedding Instance-Code without special processing.
Applications MAY embed ISCC codes that have side effects if they specify a procedure by which the embedded ISCC codes can be stripped in such a way that the stripped content will yield the original embedded ISCC codes.
ISCC Embedding
We can embed the following combination of components from the markdown version of this document into the document itself because adding or removing them has no side effect:
ISCC: CCDbMYw6NfC8a-CTtW9UFozcmBJ-CDYJsRdBNAERM
ISCC URI Scheme#
Provisional Section
The ISCC URI Scheme and link-resolver details ultimately depend on identifier registration, ownership, uniqueness and governance related decisions which are not yet part of this specification. See also: Blockchain Registry.
The purpose of the ISCC URI scheme based on RFC 3986 is to enable users to discover information like metadata or license offerings from an ISCC marked content by clicking a link on a webpage or by scanning a QR-Code.
The scheme name is iscc
. The path component MUST be a fully qualified ISCC Code without hyphens. An optional stream
query key MAY indicate the blockchain stream information source. If the stream
query key is omitted, applications SHOULD return information from the open ISCC Stream.
The scheme name component ("iscc:") is case-insensitive. Applications MUST accept any combination of uppercase and lowercase letters in the scheme name. All other URI components are case-sensitive.
Applications MAY register themselves as a handler for the "iscc:" URI scheme if no other handler is already registered. If another handler is already registered, an application MAY ask the user to change it on the first run of the application.
URI Syntax#
<foo>
means placeholder, [bar]
means optional.
iscc:<fq-iscc-code>[?stream=<name>]
URI Example#
iscc:11TcMGvUSzqoM1CqVA3ykFawyh1R1sH4Bz8A1of1d2Ju4VjWt26S?stream=smart-license
Procedures & Algorithms#
Base58-ISCC#
The ISCC uses a custom per-component data encoding similar to the zbase62 encoding by Zooko Wilcox-O'Hearn but with a 58-character symbol table. The encoding does not require padding and will always yield component codes of 13 characters length for 72-bit component digests. The predictable size of the encoding is a property that allows for easy composition and decomposition of components without having to rely on a delimiter (hyphen) in the ISCC code representation. Colliding body segments of the digest are preserved by encoding the header and body separately. The ASCII symbol table also minimizes transcription and OCR errors by omitting the easily confused characters 'O', '0', 'I', 'l'
and is shuffled to generate human readable component headers.
Symbol table
SYMBOLS = "C23456789rB1ZEFGTtYiAaVvMmHUPWXKDNbcdefghLjkSnopRqsJuQwxyz"
encode#
Signature: encode(digest: bytes) -> str
The encode
function accepts a 9-byte ISCC Component Digest and returns the Base58-ISCC encoded alphanumeric string of 13 characters, which we call the ISCC-Component Code.
See also: Base-ISCC Encoding reference code
decode#
Signature: decode(code: str) -> bytes
the decode
function accepts a 13-character ISCC-Component Code and returns the corresponding 9-byte ISCC-Component Digest.
See also: Base-ISCC Decoding reference code
Content Normalization#
The ISCC standardizes some content normalization procedures to support reproducible and stable identifiers. Following the list of normalization functions that MUST be provided by a conforming implementation.
text_trim#
Signature: text_trim(text: str) -> str
Trim text such that its UTF-8 encoded byte representation does not exceed 128-bytes each. Remove leading and trailing whitespace.
See also: Text trimming reference code
text_normalize#
Signature: text_normalize(text: str, keep_ws: bool = False) -> str
We define a text normalization function that is specific to our application. It takes text and an optional boolean keep_ws
parameter as an input and returns normalized Unicode text for further algorithmic processing. The text_normalize
function performs the following operations in the given order while each step works with the results of the previous operation:
- Decode to native Unicode if the text is a byte string
- Remove leading and trailing whitespace
- Transform text to lowercase
- Decompose the lower case text by applying Unicode Normalization Form D (NFD).
- Filter out all characters that fall into the Unicode categories listed in the constant
UNICODE_FILTER
. Keep these control characters (Cc) that are commonly considered white-space:\u0009
, # Horizontal Tab (TAB)\u000A
, # Linefeed (LF)\u000D
, # Carriage Return (CR)
- Keep or remove whitespace depending on the
keep_ws
parameter - Re-Combine the text by applying
Unicode Normalization Form KC (NFKC)
.
See also: Text normalization reference code
image_normalize#
Signature: image_normalize(img) -> List[List[int]]
Accepts a file path, byte-stream or raw binary image data and MUST at least support JPEG, PNG, and GIF image formats. Normalize the image with the following steps:
- Convert the image to grayscale
- Resize the image to 32x32 pixels using bicubic interpolation
- Create a 32x32 two-dimensional array of 8-bit gray-scale values from the image data
See also: Image normalization reference code
Feature Hashing#
The ISCC standardizes various feature hashing algorithms that reduce content features to a binary vector used as the body of the various Content-Code components.
similarity_hash#
Signature: similarity_hash(hash_digests: Sequence[ByteString]) -> bytes
The similarity_hash
function takes a sequence of hash digests that represent a set of features. Each of the digests MUST be of equal size. The function returns a new hash digest (raw 8-Bit bytes) of the same size. For each bit in the input-hashes calculate the number of hashes with that bit set and subtract the count of hashes where it is not set. For the output-hash set the same bit position to 0
if the count is negative or 1
if it is zero or positive. The resulting hash digest will retain similarity for similar sets of input hashes. See also [Charikar2002].
See also: Similarity hash reference code
minimum_hash#
Signature: minimum_hash(features: Iterable[int], n: int = 64) -> List[int]
The minimum_hash
function takes an arbitrary-sized set of 32-bit integer features and reduces it to a fixed size vector of n
features such that it preserves similarity with other sets. It is based on the MinHash implementation of the datasketch library by Eric Zhu.
See also: Minimum hash reference code
image_hash#
Signature: image_hash(pixels: List[List[int]]) -> bytes
- Perform a discrete cosine transform per row of input pixels.
- Perform a discrete cosine transform per column on the resulting matrix from step 2.
- Extract upper left 8x8 corner of the array from step 2 as a flat list.
- Calculate the median of the results from step 3.
- Create a 64-bit digest by iterating over the values of step 5 and setting a
1
- for values above median and0
for values below or equal to the median. - Return results from step 5.
See also: Image hash reference code
Content Defined Chunking#
For shift resistant data chunking, the ISCC requires a custom chunking algorithm:
data_chunks#
Signature: data_chunks(data: stream) -> Iterator[bytes]
The data_chunks
function accepts a byte-stream and returns variable sized chunks. Chunk boundaries are determined by a gear based chunking algorithm based on [WenXia2016].
See also: CDC reference code
Conformance Testing#
An application that claims ISCC conformance MUST pass all required functions from the ISCC conformance test suite. The test suite is available as JSON data in our Github Repository. Test data is structured as follows:
{
"<function_name>": {
"required": true,
"<test_name>": {
"inputs": ["<value1>", "<value2>"],
"outputs": ["value1>", "<value2>"]
}
}
}
The test suite also contains data for functions that are considered implementation details and MAY be skipped by other implementations. Optional tests are marked as "required": false
.
Outputs that are expected to be raw bytes are embedded as HEX encoded strings in JSON and prefixed with hex:
to support automated decoding during implementation testing.
Example
Byte outputs in JSON test data:
{
"data_chunks": {
"test_001_cat_jpg": {
"inputs": ["cat.jpg"],
"outputs": ["hex:ffd8ffe1001845786966000049492a0008", ...]
}
}
}
License#
Copyright © 2016 - 2020 The Authors, Content Blockchain Project
This work is licensed under a Creative Commons (CC BY-NC-SA 4.0).