In the past few years of working at NEC Labs, I have managed to release several OSS pacakges on GitHub for ease of the deployment of a retail video surveillance POC. The design and organization follow the Dependency Inversion Principle (DIP) instead of direct dependencies on a particular deep learning framework such as PyTorch. A target ML application is supposed to invoke the APIs provided by the packages. While the best documentation is the code per se, several feature highlights are worth going through. Hopefully, one may find them useful for production ML applications.

ML

So far, the only backend is PyTorch and ML essentially mimics the APIs. Nonetheless, it is possible to replace with other alternatives such as TensorFlow. This is in case that the lastest release of PyTorch comes with incompatibility or even bugs. In that regard, it allows ML to quickly work around those potential issues before the official fix such taht the target ML application remains intact.

Flexible Configurations

The configuration APIs follows the YAML format with several enhancements:

  • Accept scientific notation without decimal point in case of YAML 1.1
  • Support a custom YAML constructor !include for ease of hierarchical configuration management
    • The imported YAML config is allowed to update its parent YAML config nodes

Here is a sample app project with configuration files under configs/:

|____app
| |____configs
| | |____defaults.yml
| | |____sites
| | | |____site2.yml
| | | |____site1.yml
| | |____app.yml
| |____src
| | |____program.py

A minimal sample program.py is as follows to wrap its main(...) with the ML app launchar:

#!/usr/bin/env python

def main(cfg):
    print(f"Running main() with cfg={cfg}")

if __name__ == '__main__':
    from ml import app
    from ml.argparse import ArgumentParser
    parser = ArgumentParser()
    cfg = parser.parse_args()
    app.run(main, cfg)

Sample configuration files are placed under configs/. The top level one is configs/app.yml for the program to invoke with option --cfg:

$> ./program.py --cfg configs/app.yml

Looking into configs/app.yml below, a custom !include constructor is supported to include a specific configuration file that updates its parent configurations.

import:
  defaults: 
    app_site: !include defaults.yml
  app_site: !include sites/site1.yml

app_platform: Apple M1
app_OS: MacOSX

configs/defaults.yml serves as the default site config as follows:

  name: Little Planet Branch
  location: St. Louis, MO
  template: XXX

configs/sites/site1.yml is the configuration for a specific site that may overwrite the defualts:

  name: site1
  location: Santa Clara, CA
  note: Welcome to CA

Since app.yml includes site1.yml, site1.yml will update the configuration in defaults.yml as updating a Python dictionary. Therefore, the resulting configuration is as follows:

%> src/program.py --cfg configs/app.yml
Running main() with cfg={   '__file__': PosixPath('configs/app.yml'),
    'app_OS': 'MacOSX',
    'app_platform': 'Apple M1',
    'app_site': {   'location': 'Santa Clara, CA',
                    'name': 'site1',
                    'note': 'Welcome to CA',
                    'template': 'XXX'},
    'daemon': False,
    'deterministic': False,
    'dist': None,
    'dist_backend': 'nccl',
    'dist_no_sync_bn': False,
    'dist_port': 25900,
    'dist_url': 'env://',
    'gpu': [],
    'logfile': None,
    'logging': False,
    'no_gpu': False,
    'rank': -1,
    'seed': 1204,
    'slurm_constraint': None,
    'slurm_cpus_per_task': 4,
    'slurm_export': '',
    'slurm_mem': '32G',
    'slurm_nodelist': None,
    'slurm_nodes': 1,
    'slurm_ntasks_per_node': 1,
    'slurm_partition': 'gpu',
    'slurm_time': '0',
    'world_size': -1}

Note only keys name and location are replced while template remains as is. The semantics for additional keys such as note is to merged with the parent configuration. Other key-value settings are specified by ml.argparse as default command line options for other features. Rather than adding too many command line options, one or more key-value settings in the configuration can be overwritten in the same syntax key=value following --cfg /path/to/config.yml in the command line. For example, to replace the value of app_site.location with San Jose, CA, the following command line suffices:

%> src/program.py --cfg configs/app.yml app_site.location='San Jose, CA'
Running main() with cfg={   '__file__': PosixPath('configs/app.yml'),
    'app_OS': 'MacOSX',
    'app_platform': 'Apple M1',
    'app_site': {   'location': 'San Jose, CA',
                    'name': 'site1',
                    'note': 'Welcome to CA',
                    'template': 'XXX'},
    ...

HDF5 Compression

When saving Python objects, depending on the suffixes, pickle, h5 and pytorch binaries are supported. If h5 is chosen to save sparse binary features, compression options such as zstd can be enabled to reduce storage significantly.

GPU Visibility

Deep learning programs tend to access GPUs in parallel. Managing GPU visibility is crucial to facilitate distributed training and other processing that require GPU access. ML provides simple app launcher APIs to support common GPU access options.

Daemon Mode

It is common to kick start a long running training process through a remote shell. Instead of figuring out how screen and nohup work, turning a process into a daemon to detach from the terminal is as simple as providing a command line option. The ML app launcher APIs deal with the underlying complexities as usual as running the same program in the foreground.

Distributed Training and Execution

Composing a distributed parallel program in Python can be daunting and error prone. The ML app launcher APIs support PyTorch and SLURM backends for training and general execution given command line options. The PyTorch backend assumes one GPU per worker process on one GPU node. The SLRUM backend supports a cluster environment and execution across multiple GPU nodes.

TensorRT Deployment

For production inference to be competitie, ML model deployment optimization is necessary to reduce the runtime cost. TensorRT is a popular backend for ML to support. The APIs make it straighforward to convert a pretrained model into its TensorRT counterpart.

Checkpoints from AWS/S3 and Google Drive

By default, PyTorch hub APIs only supports loading checkpoints from direct URLs. ML hub APIs further supports AWS/S3 and Google Drive for private or 3rd party checkpoint storage. This makes it easy for business deployment at a low cost.

ML-Vision

TBD

ML-WS

TBD

feedstocks

TBD