Skip to content

ISCC - Capabilities

The ISCC is a content-derived, multi-component, similarity-preserving identifier and fingerprint for digital content, standardized as ISO 24138:2024. Because it is computed from the content itself, independent parties can derive the same code from the same asset. The capabilities below follow from that design.

Multi-layer content identification

ISCC algorithmic design

An ISCC-CODE is a composite of several ISCC-UNITs, each produced by a distinct algorithm and each capturing a different layer of identity or similarity. A composite ISCC-CODE contains at minimum a Data-Code and an Instance-Code. The units are self-describing and can also be used in isolation.

  • Meta-Code: similarity of the descriptive metadata, such as the title and description.
  • Semantic-Code: conceptual similarity of the meaning of the content, independent of its exact wording or encoding (reserved in ISO 24138:2024; no standardized algorithm yet - see below).
  • Content-Code: perceptual and structural similarity of the content, with dedicated algorithms for the Text, Image, Audio, and Video modalities.
  • Data-Code: similarity of the raw data, the encoded bitstream of the file.
  • Instance-Code: exact data identity: a cryptographic checksum of the bytes.

Semantic-Code

ISO 24138:2024 reserves a MainType for Semantic-Codes (conceptual similarity) but does not yet define an algorithm for them. Experimental implementations exist for text (iscc-sct) and images (iscc-sci).

Similarity-preserving fingerprints

ISCC similarity hash

Each ISCC-UNIT, except for the Instance-Code, is a similarity-preserving hash: similar inputs produce codes that are close in Hamming distance. Likeness can therefore be estimated by comparing codes directly, without access to the original files.

Because similarity is preserved independently per layer, comparing the individual units shows how two assets relate:

  • Matching Content-Code, differing Data-Code and Instance-Code → the same work in a different encoding, such as a re-compressed image.
  • Matching Data-Code, differing Instance-Code → near-identical files with a small byte-level change.
  • Matching Instance-Code → bit-for-bit identical files.

This supports near-duplicate detection, deduplication, clustering, and integrity checks from the codes alone.

Decentralized, algorithmic generation

ISCC decentralized issuance

ISCCs are generated from the content, not assigned by an authority. No registration, account, or central database is involved: anyone with the open-source software and the asset derives the same code. Generation is implemented in iscc-core, the ISO 24138 reference implementation, and the higher-level iscc-sdk.

Data integrity and exact matching

The Instance-Code is a cryptographic hash of the raw bytes. Any single-bit change produces a completely different Instance-Code, which makes it a reliable check for data integrity and bit-exact identity. The Data-Code, by contrast, is a soft similarity hash designed to survive minor modifications. Together they distinguish an identical file from a merely similar one.

Robust across formats

The Content-Code is derived from the decoded, normalized content rather than the file bytes, so it is robust to format conversion, re-compression, resizing, and minor edits. Different encodings of the same work produce identical or closely matching Content-Codes, which lets an ISCC connect related formats, such as PDF, EPUB, and Word files, or JPEG and PNG images, without relying on filenames or external metadata.

Granular content identification

Beyond one code per layer, the ISCC ecosystem supports SIMPRINTS: headerless similarity hashes that describe individual segments of a work, such as a paragraph, an image region, or a scene. SIMPRINTS enable fine-grained matching like partial overlap, quotation, and paraphrase detection.

Granular features are not packed into the ISCC-CODE. The ISCC-CODE stays a single, fixed-size, document-level descriptor; the segment-level SIMPRINTS can optionally be recorded alongside it in the ISCC metadata as a features list, each entry carrying the offset and size that locate it in the original content. The work and its parts are linked by association within one metadata record, not by nesting the parts inside the code.

{
  "iscc": "ISCC:KACZH265WE3KJOSR...",
  "units": ["ISCC:AADZH265WE3KJOSR...", "ISCC:EADUZ5XBKQCWGG4H...", "..."],
  "features": [
    {
      "maintype": "content",
      "subtype": "text",
      "simprints": ["8IAnFvInk24iEkDG...", "GH7W703iOzPEyhD2..."],
      "offsets": [0, 698],
      "sizes": [698, 469]
    }
  ]
}

The document-level iscc and units describe the whole work; the features list locates each segment within it.

Status

SIMPRINTS and Semantic-Codes are experimental and not yet part of ISO 24138. The granular text and image implementations are iscc-sct and iscc-sci.

Versioning and variant detection

Because the codes preserve similarity, an ISCC can track versions and variants of a work over time. Comparing Content-Codes places two assets on a similarity scale, measured as the share of matching bits across the 64-bit code body, while the Instance-Code separates an exact duplicate from a near-duplicate. This helps spot edits, watermarked copies, and re-releases.

Timestamping and provenance

The in-development ISCC-ID binds an ISCC-CODE to a verifiable point in time. It is issued by an ISCC-HUB in response to a signed declaration and returned with a signed receipt; integrity rests on an append-only transparency log and standard RFC 3161 or OpenTimestamps based timestamping. This provenance layer does not require registering the ISCC on any distributed ledger. It also makes the ISCC a useful soft-binding identifier for content provenance and authenticity workflows, such as C2PA.

Status

The ISCC-ID and the surrounding discovery protocol are under active development and are not yet part of ISO 24138:2024.

Cross-sector and complementary

The ISCC works across text, image, audio, and video, which makes it applicable wherever digital content is produced, processed, or distributed: journalism, books, music, film, science, and more. It makes existing identifiers such as ISBN, ISRC, and DOI discoverable by content rather than replacing them.