Existing neural network frameworks are huge and hard to use. WebNNM is a lightweight easy to understand high-level framework with an embedded expert system that simplifies working with neural networks for newcomers whilst offering experts fine grained control. This specification defines a high-level format for neural network models, and the API exposed by the associated JavaScript library. For an introduction and links to web-based demos, see [[WEBNNM-INTRO]].
This document reflects implementation experience, but is still subject to change. Feedback is welcome through GitHub issues or on the public-cogai@w3.org mailing-list (with public archives).
This specification introduces a simple notation and API for inference, testing and training neural networks models in Web browsers using [[WebNN]] for hardware acceleration. The intent is to make developing and using neural networks easy for newcomers, whilst offering experts fine grained control.
WebNNM makes it easy to define neural network models in terms of blocks and layers. This includes the means to define layers as a composition of other layers, and to repeat layers using a sequence of tensor shapes. The WebNNM library is a JavaScript module that offers a simple API for declaring models, loading models from previously saved snapshots, and using them for inference, testing and training in conjunction with JavaScript modules for handling datasets. Models can be pre-trained in the cloud via WebNNM's support for export to the [[StableHLO/MLIR]] format. Fine-tuning in the browser then allows for privacy-friendly personalization, avoiding the need to send personal data to the cloud. WebNNM's expert system reviews the neural network, the hyperparameters and the dataset in respect to training stability and precision, avoiding difficult choices for novices, and warning where potential problems may arise.
The grammatical rules in this document are to be interpreted as described in [[[RFC5234]]] [[RFC5234]].
Conformance to this specification is defined for five conformance classes:
WebNNM models are defined as a set of blocks, where each block has a set of properties, including a sequence of layers.
Each layer has an operator, e.g. matmul, and zero or more operands followed by zero or more options. In this example, shapes, shape and activation are options. The shapes option signifies that a layer is to be repeated using the corresponding tensors shapes. In this case, a dense layer with output shape [128], a second with shape [80] and a third with shape [40]. WebNNM uses macros to bind options and their values. This is used to bind the activation to "relu" or "softmax". w and b are operands used to name model parameters. Note that the tensor shapes are declared without the dimensions for the batch size or sequence length, which are bound later. You are free to declare the data types if needed. The default data type is float32, subject to type inference. WebNNM applies tensor data type and shape inference, removing the need to explicitly specify the shape for every layer.
The permitted datatypes are taken from [[WebNN]]: float32, float16, int32, uint32, int8, uint8, int4, and uint4 (in order of decreasing precision). Note that [[WebNN]] places further restrictions on which data types can be used for certain operators. In addition, [[WebNN]] uses int32 for the size of each dimension, limiting it a maximum of 2147483648.
WebNNM applies inference from the inputs and outputs. Automatic data type casts are applied when needed from a lower to a higher precision, e.g. float16 to float32, but not from a higher to a lower precision. Note that type casting is applied lazily from the inputs to the outputs, i.e. preserving the lower precision until a cast to a higher precision is required. If you need greater control, you can provide explicit data types on the layers of interest.
Block input and output properties are expressed as the block name, followed by a colon then "input" or "output" respectively. This is followed by an optional data type and then a required shape, e.g. model:input float16 shape=[784]. Block properties are terminated with a semicolon. Layers are comma separated. The block name property is used when saving a snapshot.
WebNNM parses to an object model for blocks and layers, and from there to a directed acyclic graph (DAG) with node types for inputs, outputs, parameters, literals and WebNN operations. Each node has an explicit data type and tensor shape. Nodes that are not on a path from an input node to an output node are culled, along with associated parameters and literals. Higher level operations are decomposed into WebNN operations as part of the process of building the DAG. Parameter nodes are annotated with attributes relating to how their tensors should be initialised based upon the activation function, for instance the Glorot or He algorithms, as well as good initial values for bias parameters.
The DAG is used to build executable WebNN graphs on demand for inference, testing and training. For testing, the dataset provides a sequence of batches of test data where the last batch may be smaller than the rest. The testing graph uses masking to ignore the missing data in the last batch. Training is similar. The training graph starts with a forward pass to compute the loss, then a backwards pass to compute the gradient of the loss with respect to each model parameter. This is followed by an optimizer that computes updates for the model parameters along with their momentum. A ping/pong approach is used to retain parameter values in the GPU or NPU's memory until training is complete. Further details are given in a later section.
WebNNM aims to make things easy for newcomers whilst giving expert users the control they seek. An expert system auto-configures hyperparameters unless explicitly overriden, along with providing warnings about potential problems.
WebNNM iterates through the DAG to build a profile of the network (e.g., looking for normalization layers, dropout, and counting parameter-heavy operations). The profile is mapped against the provided hyperparams to either auto-configure settings or emit warnings. Here is a comparison with other neural network frameworks:
| Feature | WebNNM | Keras (TF/JAX) | PyTorch Lightning | TensorFlow.js |
| Philosophy | Expert-Guided: Proactively analyzes the DAG to fix common pitfalls. | User-Centric: Provides building blocks; leaves stability to the user. | Research Abstraction: Removes boilerplate but requires manual config. | Deployment First: Focuses on running models; training is secondary. |
| Regularization | Capacity-Aware: Auto-enables L2 if a high-capacity model lacks Dropout. | Manual / Granular: Must be added per-layer (kernel_regularizer). | Optimizer-Linked: Usually manual via weight_decay in the optimizer. | Manual: Required per-layer; no global "auto-regularize" logic. |
| Gradient Stability | Automatic (DAG-based): Detects depth/norms to enable AGC or Global Norm. | Manual: User must manually add clipping to the optimizer. | Configurable: User sets flags; framework doesn't "suggest" values. | Manual: Very low-level; no "auto" scaling/clipping. |
| Precision (float16) | Adaptive: Auto-scales based on model data types and DAG risk. | AMP: Efficient, but doesn't "warn" about specific layer risks. | AMP Toggle: Simple toggle; no topological analysis of risk. | Semi-Manual: Requires manual casts and precision management. |
| Multimodal Loss | Auto-Balanced: Avoids underflow by learning relative weights automatically. | Manual: User must provide a dictionary/list of weights. | Manual: Logic is typically buried in the training_step. | Manual: Verbose implementation for multi-output loss. |
For consumer grade equipment, NPUs often support $float16$ but not $float32$. To speed training and prolong battery life, it is desirable to enable WebNN to use the NPU where practical through the use of $float16$ where this doesn't detract from training stability. WebNNM supports automatic casting and switches to $float16$ for critical operations. Note that WebNNM does not support $bfloat16$ although that may be added in future if and when it is added to WebNN.
This section describes the API for inference, testing and training, along with the API used for datasets.
The example starts with importing the WebNNM module and a dataset module. The application logic is called when the web page has finished loading. Note that the load event handler is marked as async, which is essential for subsequent use of await within the handler. The handler creates an instance of the dataset and declares the WebNNM model. These are passed to the NNModel.create() to create an instance of the model and bind it to the dataset. Finally, the application calls model.test() to apply the test subset of data provided by the dataset. Note that the WebNNM library calls the log function to log messages. This function defaults to console.log, but can be overridden by the application to log messages to the web page.
This section describes the API for inference.
The example starts with int32 for the input tensor and progressively casts it to float16 and finally to float32 for the output tensor. Note the use of numeric literals for operands for the add and pow operations.
The application calls model.createContext() to create a context object for inference. This object is used to initialise the input tensor before calling model.run(context) to create and execute the inference graph. context.output() is then used to retrieve the output tensor. Note the use of model.view() as a convenience function for logging tensors.
Context object methods:
setBatchSize(batchSize)async input(name)async output(name)async randomize(blockName, lower, upper)setData(blockName, data)This section describes the API for testing.
model.test()This section describes the API for training.
Training is similar to testing, but involves a set of training hyperparameters, so called to avoid confusion with the model's trainable parameters:
model.train(hyperparameters)hyperparameters is an object with the following optional properties:
The model parameters are validated after each epoch if the dataset provides a validation subset. On detecting a minima in the loss, the model weights are saved, and training continues for a given number of epochs. The training loss is reported every 50 epochs. Applications should always test the model when training completes. Testing the model on independent data after training is essential to verify that the network has truly learned generalizable patterns rather than simply memorizing the training data.
WebNNM supports a choice of learning rate optimizer. The choices all involve tracking the gradient and its momentum for each of the model parameters:
softsign in place of arctan, which isn't supported by WebNN.Training can be rather time consuming, so it makes sense to avoid waisting time when the initial choice of model parameter values doesn't train well. WebNNM creates a scouting graph to evaluate a set of randomly chosen potential starting points using a few heuristics. The metrics include: Loss, Gradient Ratio, and Dead Ratio. The gradient ratio detects vanishing or exploding gradients, whilst the dead ratio detects dead neurons or overconfident layers.
WebNNM manages training in terms of a warm up phase in which the learning rate is gradually increased to the maximum learning rate. After that the learning rate is gradually reduced using cosine annealing to a very low value when reaching the maximum number of epochs. The default value for warmupEpochs is 5% of the maximum number of epochs, or 1, whichever is larger.
Multimodal models combine different modalities, e.g. video and audio. Multimodal models provide superior accuracy and robustness by leveraging redundant information and cross-sensory context to resolve ambiguities that a single data source cannot. Multi-task models are trained on multiple objectives from a shared representation. WebNNM allows you to specify the relative weights of each modality (or task) to avoid one modality drowning out the learning signals for another, especially in $float16$ environments where underflow is a constant risk. WebNNM addresses this challenge using Uncertainty-Based Weighting. Each task $i$ is assigned a learnable log-variance $s_i$. The total loss $L$ is computed as:$$L = \sum_i \left( \frac{1}{2e^{s_i}} L_i + \frac{1}{2}s_i \right)$$ This allows the optimizer to dynamically attenuate noisy tasks. To ensure stability in $float16$, $s_i$ is initialized at $0.0$ and a gradient scale $S \ge 128$ is recommended.
WebNNM supports transfer learning by allowing you to freeze parameter updates for the early layers of the network. This is based upon annotating each node in the DAG with the maximum number of nodes to reach an input node, and likewise to reach an output node. The freeze hyperparameter is a number in the range 0 to 1, defaulting to 0. It describes the proportion of layers to freeze, starting from the inputs. Parameters associated with nodes closer to the outputs are updated with a fraction that goes from 0 to 1 in a sinusoidal pattern (in range 0 to $\pi/2$). This range is discretised into 5 buckets to enable kernel optimization.
Training involves computing gradients that risk vanishing or exploding. For models with many layers it is good practice to introduce skip connections that bypass a few layers that would otherwise overly weaken the learning signals. It is common to use float32 whilst training to provide an adequate precision for gradients. However, the hardware accelerators in consumer grade equipment may not support float32, motivating the use of a lower precision, e.g. float16. This increases the risk of underflow and overflow during gradient calculations. Another consideration is that whilst GPUs can be used for hardware acceleration of neural networks, they consume considerably more electrical power than NPUs like Apple's ANE. To speed up training in the browser and reduce power consumption, it is desirable to find ways to increase training stability when using float16.
To address this challenge, WebNNM supports techniques to boost weak gradients, to scale down overly large gradients, and ignore updates that overflow the maximum number supported for a given precision. The scaling hyperparameter controls gradient scaling, whilst clipping controls clipping. If you set these to 'auto' (the default) the library examines the width and depth of the model along with the activation functions to assess whether to enable scaling and clipping. For explicit control set these hyperparameter to 'on' or 'off' as desired.
To prevent underflow during the backward pass, WebNNM multiplies the loss $L$ by a scale factor $S$ (where $S \gg 1$) before backpropagation begins. By the chain rule, this scales every subsequent gradient in the computational graph:$$\nabla_{\theta} (L \cdot S) = S \cdot \nabla_{\theta} L$$This ensures that the intermediate gradients $g$ that would have been $10^{-7}$ (and thus zeroed out in float16) are now $g \cdot S$, keeping them within the precision range. We cannot apply the scaled gradients directly to the model's parameters, as this would be equivalent to using a learning rate that is $S$ times too large. Before the optimizer updates the parameter $\theta$, the gradients need to be unscaled back to their original magnitude:$$\theta_{t+1} = \theta_t - \eta \cdot \frac{\sum \nabla_{\theta} (L \cdot S)}{S}$$
Use the scale hyperparameter to set the initial gradient scaling factor. The scale is doubled until overflow is detected at which point it is halved. To prevent instability, the growthInterval hyperparameter sets the number of batches without overflow to wait before boosting the scale factor. The forward and backward passes may use float16 for speed, but it is better to keep the model parameters in float32, and likewise to cast from float16 to float32 for operations like softmax and batchnorm. WebNNM will do this for you unless you explicitly set the data types in the model.
Clipping is based upon the scaled global (L2) norm $||\mathbf{G}||_2$ which is defined as the square root of the sum of the squares of every individual scalar value across all gradient tensors $g_i$:$$||\mathbf{G}||_2 = \sqrt{\sum_{i=1}^{n} \sum_{x \in g_i} x^2}$$
The training graph computes the rolling mean for the scaled global norm and its absolute deviation, as a basis for dynamically setting the clipping threshold:
Note that $k$ is 4.0 for the first 15 batches and 2.0 after that.
Work is underway to support automated gradient clipping (AGC) and regularization on a layer by layer basis.
This section describes the API for datasets.
This section lists the WebNN operators supported by WebNNM along with additional operators, e.g. for skip connections and transformers, that are internally translated into WebNN operators.
Some common operators are not built into WebNN and are compiled into WebNN sub-graphs. These include:
This section defines the grammar for the model syntax.
WebNNM supports the following loss functions and checks that they are consistent with the activation function, label format and a sample of the training data. You can set the loss function explicitly as the loss property for the associated block(e.g. model:loss CCEL;), otherwise WebNNM will pick one for you based upon the context. Note that some loss functions have parameters, e.g. $\delta$ for Huber Loss and $\gamma$ for Focal Loss. These can be specified as block properties, e.g. model:delta 2.1;. If they are not defined, the WebNNM expert system will set them based on the standard deviation of the error observed during the initial scouting pass.
| Loss Function | Mathematical Form (Loss) | Explanation | Best Suited For | WebNN Implementation Primitives |
| Mean Squared Error (MSE) | $\frac{1}{n}\sum(y - \hat{y})^2$ | Calculates the average of the squares of the errors | Standard Regression: Best for tasks where targets are continuous values and you want to penalize large errors more heavily than small ones, e.g. prediction house prices. | `sub` $\rightarrow$ `pow(2)` $\rightarrow$ `reduceMean` |
| Mean Absolute Error (MAE) | $\frac{1}{n}\sum\|y - \hat{y}$ | Calculates the average of the absolute difference between targets and predictions | Robust Regression: Ideal when your dataset contains significant outliers as it is less sensitive to extreem values than MSE. | `sub` $\rightarrow$ `abs` $\rightarrow$ `reduceMean` |
| Binary Cross-Entropy (BCE) | $-\frac{1}{n}\sum [y\ln(p) + (1-y)\ln(1-p)]$ | Measures the performance of a classification model whos output is a probability value between 0 and 1 | Binary Classification: Used for single-label or multi-label "yes/no" tasks. | `log`, `mul`, `neg`, `sub` |
| Categorical Cross-Entropy (CCE) | $-\sum y_i \ln(\hat{y}_i)$ | Measures the difference between two probability distributions | Multi-class Classification: Standard for "one-of-many" classification. Expects Softmax inputs. | `log`, `mul`, `reduceSum`, `neg` |
| CCE with Logits (CCEL) | $-\sum y_i \cdot {LogSoftmax}(\hat{y}_i)$ | Combines Softmax activation and Cross-Entropy into a single stable step | Stability Choice: Expert system should force this if `NPU` (FP16) is detected to prevent overflow. | `LogSoftmaxNode` $\rightarrow$ `mul` $\rightarrow$ `neg` |
| Sparse Categorical Cross-Entropy (SCE) | $-\ln(\hat{y}_{target})$ | Memory-Efficient Multi-class: Identical to CCE but accepts integer labels (e.g. 3) instead of one-hot vectors ([0,0,1,0]) | Memory Efficient: For use when labels are integer indices rather than one-hot vectors. | `gather` (to select logits) $\rightarrow$ `neg` |
| Hinge Loss (HIL) | $\sum \max(0, 1 - y_i \cdot \hat{y}_i)$ | Support Vector Machines: A "maximum-margin" loss function | Used for SVM-style classification; robust to small variations, where you want a safety margin between class boundaries. | `mul`, `sub`, `relu` |
| Huber Loss (HUL) | $\begin{cases} \frac{1}{2}(y-\hat{y})^2 & \text{if } \text{error} \leq \delta \\ \delta(\|y-\hat{y}\| - \frac{1}{2}\delta) & \text{else} \end{cases}$ | General Purpose Regression: Acts as MSE for small errors and MAE for large errors | Balanced Regression: Switch to this if `scout` detects high variance in gradients. | `sub`, `abs`, `where`, `mul` |
| KL Divergence (KLD) | $\sum y_i \ln(\frac{y_i}{\hat{y}_i})$ | Distribution Matching: Measures how a probability distribution approximates a reference distribution | Used in Variational Autoencoders (VAEs) and Knowledge Distillation | `div` (or `sub` of logs) $\rightarrow$ `mul` |
| Focal Loss (FOL) | $-\alpha(1 - p_t)^\gamma \ln(p_t)$ | Adds a $(1 - p_t)^\gamma$ factor to the Cross-Entropy loss | Imbalanced Data: Recommended if the `dataset` class counts are skewed, where the model needs to focus on harder examples rather than on the easier ones. | `pow`, `sub`, `log`, `mul` |
This section defines the binary format for models and their parameters.
This section describes the test harness used to validate the WebNNM library's implementation of analytic gradients. The approach taken is to measure the gradients for each parameter and compare it to the analytic value computed by the WebNNM library for each operator. This done for repeatedly for a set of randomly initialised tensors for the parameters and the input from the previous layer. In more detail, for each operator:
Note: the analytic gradients for the WebNN recurrent operators, e.g. LSTM and GRU are computed by mapping them to the primitive operations and unrolling over time for a given sequence length.