# Algorithms implemented¶

Most methods aim to learn a common representation space for source and target domain, splitting the classical end-to-end deep neural network into a feature extractor with parameters \(\Phi\) and a task classifier with parameters \(\theta_y\). Alignment between source and target feature distributions is obtained by adding an alignment term \(L_d\) to the usual task loss \(L_c\):

\(L = L_c + \lambda \cdot L_d\)

This alignment term is controlled by a parameter \(\lambda\) which grows from 0 to 1 during learning. Some algorithms use a third network with parameters \(\theta_d\) to parameterize the aligment term \(L_d\).

A typical algorithm may thus be represented with 2 or 3 blocks as in the figure below:

Three types of alignment terms are implemented in ADA, leading to 3 families of methods:

- Adversarial methods, similar to DANN, use a so-called domain classifier with parameters θd as an adversary to align the features,
- Optimal-transport based methods, in which the domain classifier, called a critic, is trained to minimize the divergence between the source and target feature distributions,
- Kernel-based methods, which minimize the maximum mean discrepancy in the kernel space to align features.

## DANN-like methods¶

The common part of these methods is that they all use a gradient-reversal layer as described in the DANN paper.

- DANN architecture from Ganin, Yaroslav, et al. “Domain-adversarial training of neural networks.” The Journal of Machine Learning Research (2016) https://arxiv.org/abs/1505.07818
- CDAN: Long, Mingsheng, et al. “Conditional adversarial domain adaptation.” Advances in Neural Information Processing Systems. 2018. https://papers.nips.cc/paper/7436-conditional-adversarial-domain-adaptation.pdf
- and its variant CDAN-E (with entropy weighting).

- FSDANN: a naive adaptation of DANN to the fewshot setting (using known target labels in the task loss)
- MME: Saito, Kuniaki, et al. “Semi-supervised domain adaptation via minimax entropy.” Proceedings of the IEEE International Conference on Computer Vision. 2019 https://arxiv.org/pdf/1904.06487.pdf
- this method uses the GRL layer on the entropy of the task classifier output for target samples.

## Optimal transport methods¶

Currently WDGRL is implemented, as described in Shen, Jian, et al. “Wasserstein distance guided representation learning for domain adaptation.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018. https://arxiv.org/pdf/1707.01217.pdf

Its variant WDGRLMod better fits the pytorch-lightning patterns. The difference is that the critic is optimized on k_critic different batches instead of k_critic times on the same batch.

- When the beta_ratio parameter is not zero, both these method also implement their asymmetric ($beta$) variant described in:
- Wu, Yifan, et al. “Domain adaptation with asymmetrically-relaxed distribution alignment.” ICML (2019) https://arxiv.org/pdf/1903.01689.pdf

## MMD-based methods¶

- DAN
- Long, Mingsheng, et al. “Learning Transferable Features with Deep Adaptation Networks.” International Conference on Machine Learning. 2015. http://proceedings.mlr.press/v37/long15.pdf

- JAN
- Long, Mingsheng, et al. “Deep transfer learning with joint adaptation networks.” International Conference on Machine Learning, 2017. https://arxiv.org/pdf/1605.06636.pdf

Both these methods have been implemented based on the authors code at https://github.com/thuml/Xlearn.

## Implementation¶

The classes for the unsupervised domain architectures are organised like this, where each arrow denotes inheritance:

Most methods may be implemented by just writing the forward pass and the `compute_loss`

method, which should return the
two components \(L_c\) and \(L_d\), as well as the metrics to use for logging and evaluation.