From MOCO v1 to v3: In the direction of Constructing a Dynamic Dictionary for Self-Supervised Studying — Half 1 | by Mengliu Zhao

A mild recap on the momentum distinction studying framework

Have we reached the period of self-supervised studying?

Knowledge is flowing in every single day. Persons are working 24/7. Jobs are distributed to each nook of the world. However nonetheless, a lot information is left unannotated, ready for the attainable use by a brand new mannequin, a brand new coaching, or a brand new improve.

Or, it’s going to by no means occur. It would by no means occur when the world is working in a supervised style.

The rise of self-supervised studying in recent times has unveiled a brand new course. As an alternative of making annotations for all duties, self-supervised studying breaks duties into pretext/pre-training (see my earlier put up on pre-training right here) duties and downstream duties. The pretext duties give attention to extracting consultant options from the entire dataset with out the steerage of any floor reality annotations. Nonetheless, this job requires labels generated robotically from the dataset, normally by intensive information augmentation. Therefore, we use the terminologies unsupervised studying (dataset is unannotated) and self-supervised studying (duties are supervised by self-generated labels) interchangeably on this article.

Contrastive studying is a serious class of self-supervised studying. It makes use of unlabelled datasets and contrastive information-encoded losses (e.g., contrastive loss, InfoNCE loss, triplet loss, and so forth.) to coach the deep studying community. Main contrastive studying contains SimCLR, SimSiam, and the MOCO collection.

MOCO — the phrase is an abbreviation for “momentum distinction.” The core thought was written within the first MOCO paper, suggesting the understanding of a pc imaginative and prescient self-supervised studying downside, as follows:

“[quote from original paper] Laptop imaginative and prescient, in distinction, additional issues dictionary constructing, because the uncooked sign is in a steady, high-dimensional house and isn’t structured for human communication… Although pushed by numerous motivations, these (notice: latest visible illustration studying) strategies could be regarded as constructing dynamic dictionaries… Unsupervised studying trains encoders to carry out dictionary look-up: an encoded ‘question’ ought to be just like its matching key and dissimilar to others. Studying is formulated as minimizing a contrastive loss.”

On this article, we’ll do a mild overview of MOCO v1 to v3:

v1 — the paper “Momentum distinction for unsupervised visible illustration studying” was revealed in CVPR 2020. The paper proposes a momentum replace to key ResNet encoders utilizing pattern queues with InfoNCE loss.
v2 — the paper “ Improved baselines with momentum contrastive studying” got here out instantly after, implementing two SimCLR structure enhancements: a) changing the FC layer with a 2-layer MLP and b) extending the unique information augmentation by together with blur.
v3 — the paper “An empirical research of coaching self-supervised imaginative and prescient transformers” was revealed in ICCV 2021. The framework extends the key-query pair to 2 key-query pairs, which had been used to type a SimSiam-style symmetric contrastive loss. The spine additionally acquired prolonged from ResNet-only to each ResNet and ViT.