Publications
2024
- Sketchy Moment Matching: Toward Fast and Provable Data Selection for FinetuningYijun Dong*, Hoang Phan*, Xiang Pan*, and Qi LeiAdvances in Neural Information Processing Systems 2024
We revisit data selection in a modern context of finetuning from a fundamental perspective. Extending the classical wisdom of variance minimization in low dimensions to high-dimensional finetuning, our generalization analysis unveils the importance of additionally reducing bias induced by low-rank approximation. Inspired by the variance-bias tradeoff in high dimensions from the theory, we introduce S̱ketchy M̱oment M̱atching (SkMM), a scalable data selection scheme with two stages. (i) First, the bias is controlled using gradient sketching that explores the finetuning parameter space for an informative low-dimensional subspace \mathcalS; (ii) then the variance is reduced over \mathcalS via moment matching between the original and selected datasets. Theoretically, we show that gradient sketching is fast and provably accurate: selecting n samples by reducing variance over \mathcalS preserves the fast-rate generalization O(\dim(\mathcalS)/n), independent of the parameter dimension. Empirically, we concretize the variance-bias balance via synthetic experiments and demonstrate the effectiveness of SkMM for finetuning in real vision tasks.
- Enhancing Domain Adaptation through Prompt Gradient AlignmentHoang Phan*, Lam Tran*, Quyen Tran*, and Trung LeAdvances in Neural Information Processing Systems 2024
Prior Unsupervised Domain Adaptation (UDA) methods often aim to train a domain-invariant feature extractor, which may hinder the model from learning sufficiently discriminative features. To tackle this, a line of works based on prompt learning leverages the power of large-scale pre-trained vision-language models to learn both domain-invariant and specific features through a set of domain-agnostic and domain-specific learnable prompts. Those studies typically enforce invariant constraints on representation, output, or prompt space to learn such prompts. Differently, we cast UDA as a multiple-objective optimization problem in which each objective is represented by a domain loss. Under this new framework, we propose aligning per-objective gradients to foster consensus between them. Additionally, to prevent potential overfitting when fine-tuning this deep learning architecture, we penalize the norm of these gradients. To achieve these goals, we devise a practical gradient update procedure that can work under both single-source and multi-source UDA. Empirically, our method consistently surpasses other prompt-based baselines by a large margin on different UDA benchmarks
- DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image GenerationHao Phung*, Quan Dao*, Trung Dao, Hoang Phan, Dimitris N. Metaxas, and Anh TranAdvances in Neural Information Processing Systems 2024
Our approach presents a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties with manually-defined scanning orders, especially in the processing of visual data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates state-of-the-art results, achieving faster training convergence, and delivering high-quality outputs.
- Randomly Pivoted V-optimal Design: Fast Data Selection under Low Intrinsic DimensionYijun Dong*, Xiang Pan*, Hoang Phan*, and Qi LeiMachine Learning and Compression Workshop 2024
Despite the ubiquitous high-dimensionalities brought about by the increasing sizes of models and data, low intrinsic dimensions are commonly found in many high-dimensional learning problems (\eg finetuning). To explore sample efficient learning that leverages such low intrinsic dimensions, we introduce randomly pivoted V-optimal design (RPVopt), a fast data selection algorithm that combines dimension reduction via sketching and optimal experimental design. Given a large dataset with N samples in a high dimension d, RPVopt first reduces the dimensionality from d to m ≪d by embedding the data to a random low-dimensional subspace via sketching. Then a coreset of size n > m is selected based on the low-dimensional sketched data through an efficient two-stage random pivoting algorithm. With a fast embedding matrix for sketching, RPVopt achieves an asymptotic complexity of O(Nd+Nnm), linear in the full data size, data dimension, and coreset size. With extensive experiments in both regression and classification settings, we demonstrate the empirical effectiveness of RPVopt in data selection for finetuning vision tasks.
- Controllable Prompt Tuning For Balancing Group Distributional RobustnessHoang Phan, Andrew Gordon Wilson, and Qi LeiProceedings of the 41st International Conference on Machine Learning 2024
Models trained on data composed of different groups or domains can suffer from severe performance degradation under distribution shifts. While recent methods have largely focused on optimizing the worst-group objective, this often comes at the expense of good performance on other groups. To address this problem, we introduce an optimization scheme to achieve good performance across groups and find a good solution for all without severely sacrificing performance on any of them. However, directly applying such optimization involves updating the parameters of the entire network, making it both computationally expensive and challenging. Thus, we introduce Controllable Prompt Tuning (CPT), which couples our approach with prompt-tuning techniques. On spurious correlation benchmarks, our procedures achieve state-of-the-art results across both transformer and non-transformer architectures, as well as unimodal and multimodal data, while requiring only 0.4% tunable parameters.
2023
- Robust Contrastive Learning With Theory GuaranteeNgoc N Tran, Lam Tran, Hoang Phan, Anh Bui, Tung Pham, Tran Toan, Dinh Phung, and Trung Le2023
Contrastive learning (CL) is an semi-supervised training paradigm that allows us to extract meaningful features without any label information. A typical CL method is divided into two phases, where it first tries to learn the features from unlabeled data, and then uses those features to train a linear classifier with the labeled data. While a fair amount of existing theoretical works have analyzed how the unsupervised loss in the first phase can support the supervised loss in the second phase, none has examined the connection between the unsupervised loss and the robust supervised loss, which can shed light on how to construct an effective unsupervised loss for the first phase of CL. To fill this gap, our work develops rigorous theories to dissect and identify which components in the unsupervised loss can help improving the robust supervised loss, and conduct proper experiments to verify our findings. All code used in this work is available at https://anonymous.4open.science/r/rosa.
- Flat Seeking Bayesian Neural NetworksAnh Nguyen, Long Vuong, Hoang Phan, Toan Do, Dinh Phung, and Trung LeAdvances in Neural Information Processing Systems 2023
Bayesian Neural Networks (BNNs) offer a probabilistic interpretation for deep learning models by imposing a prior distribution over model parameters and inferencing a posterior distribution based on observed data. The model sampled from the posterior distribution can be used for providing ensemble predictions and quantifying prediction uncertainty. It is well-known that deep learning models with a lower sharpness have a better generalization ability. Nonetheless, existing posterior inferences are not aware of sharpness/flatness, hence possibly leading to high sharpness for the models sampled from it. In this paper, we develop theories, the Bayesian setting, and the variational inference approach for the sharpness-aware posterior. Specifically, the models sampled from our sharpness-aware posterior and the optimal approximate posterior estimating this sharpness-aware posterior have a better flatness, hence possibly possessing a higher generalization ability. We conduct experiments by leveraging the sharpness-aware posterior with the state-of-the-art Bayesian Neural Networks, showing that the flat-seeking counterparts outperform their baselines in all metrics of interest.
- Sharpness & Shift-Aware Self-Supervised LearningNgoc N Tran, Son Duong, Hoang Phan, Tung Pham, Dinh Phung, and Trung LearXiv preprint arXiv:2305.10252 2023
Self-supervised learning aims to extract meaningful features from unlabeled data for further downstream tasks. In this paper, we consider classification as a downstream task in phase 2 and develop rigorous theories to realize the factors that implicitly influence the general loss of this classification task. Our theories signify that sharpness-aware feature extractors benefit the classification task in phase 2 and the existing data shift between the ideal (i.e., the ideal one used in theory development) and practical (i.e., the practical one used in implementation) distributions to generate positive pairs also remarkably affects this classification task. Further harvesting these theoretical findings, we propose to minimize the sharpness of the feature extractor and a new Fourier-based data augmentation technique to relieve the data shift in the distributions generating positive pairs, reaching Sharpness & Shift-Aware Contrastive Learning (SSA-CLR). We conduct extensive experiments to verify our theoretical findings and demonstrate that sharpness & shift-aware contrastive learning can remarkably boost the performance as well as obtaining more robust extracted features compared with the baselines.
- Global-Local Regularization Via Distributional RobustnessHoang Phan, Trung Le, Trung Phung, Tuan Anh Bui, Nhat Ho, and Dinh PhungProceedings of The 26th International Conference on Artificial Intelligence and Statistics 2023
Despite superior performance in many situations, deep neural networks are often vulnerable to adversarial examples and distribution shifts, limiting model generalization ability in real-world applications. To alleviate these problems, recent approaches leverage distributional robustness optimization (DRO) to find the most challenging distribution, and then minimize loss function over this most challenging distribution. Regardless of achieving some improvements, these DRO approaches have some obvious limitations. First, they purely focus on local regularization to strengthen model robustness, missing a global regularization effect which is useful in many real-world applications (e.g., domain adaptation, domain generalization, and adversarial machine learning). Second, the loss functions in the existing DRO approaches operate in only the most challenging distribution, hence decouple with the original distribution, leading to a restrictive modeling capability. In this paper, we propose a novel regularization technique, following the veins of Wasserstein-based DRO framework. Specifically, we define a particular joint distribution and Wasserstein-based uncertainty, allowing us to couple the original and most challenging distributions for enhancing modeling capability and applying both local and global regularizations. Empirical studies on different learning problems demonstrate that our proposed approach significantly outperforms the existing regularization approaches in various domains: semi-supervised learning, domain adaptation, domain generalization, and adversarial machine learning.
2022
- Improving Multi-task Learning via Seeking Task-based Flat RegionsHoang Phan, Lam Tran, Ngoc Tran, Nhat Ho, Dinh Phung, and Trung LearXiv preprint arXiv:2211.13723 2022
Multi-Task Learning (MTL) is a widely-used and powerful learning paradigm for training deep neural networks that allows learning more than one objective by a single backbone. Compared to training tasks separately, MTL significantly reduces computational costs, improves data efficiency, and potentially enhances model performance by leveraging knowledge across tasks. Hence, it has been adopted in a variety of applications, ranging from computer vision to natural language processing and speech recognition. Among them, there is an emerging line of work in MTL that focuses on manipulating the task gradient to derive an ultimate gradient descent direction to benefit all tasks. Despite achieving impressive results on many benchmarks, directly applying these approaches without using appropriate regularization techniques might lead to suboptimal solutions on real-world problems. In particular, standard training that minimizes the empirical loss on the training data can easily suffer from overfitting to low-resource tasks or be spoiled by noisy-labeled ones, which can cause negative transfer between tasks and overall performance drop. To alleviate such problems, we propose to leverage a recently introduced training method, named Sharpness-aware Minimization, which can enhance model generalization ability on single-task learning. Accordingly, we present a novel MTL training methodology, encouraging the model to find task-based flat minima for coherently improving its generalization capability on all tasks. Finally, we conduct comprehensive experiments on a variety of applications to demonstrate the merit of our proposed approach to existing gradient-based MTL methods, as suggested by our developed theory.
- Continual Learning with Optimal Transport based Mixture ModelQuyen Tran, Hoang Phan, Khoat Than, Dinh Phung, and Trung LearXiv preprint arXiv:2211.16780 2022
Online Class Incremental learning (CIL) is a challenging setting in Continual Learning (CL), wherein data of new tasks arrive in incoming streams and online learning models need to handle incoming data streams without revisiting previous ones. Existing works used a single centroid adapted with incoming data streams to characterize a class. This approach possibly exposes limitations when the incoming data stream of a class is naturally multimodal. To address this issue, in this work, we first propose an online mixture model learning approach based on nice properties of the mature optimal transport theory (OT-MM). Specifically, the centroids and covariance matrices of the mixture model are adapted incrementally according to incoming data streams. The advantages are two-fold: (i) we can characterize more accurately complex data streams and (ii) by using centroids for each class produced by OT-MM, we can estimate the similarity of an unseen example to each class more reasonably when doing inference. Moreover, to combat the catastrophic forgetting in the CIL scenario, we further propose Dynamic Preservation. Particularly, after performing the dynamic preservation technique across data streams, the latent representations of the classes in the old and new tasks become more condensed themselves and more separate from each other. Together with a contraction feature extractor, this technique facilitates the model in mitigating the catastrophic forgetting. The experimental results on real-world datasets show that our proposed method can significantly outperform the current state-of-the-art baselines.
- Stochastic Multiple Target Sampling Gradient DescentHoang Phan, Ngoc Tran, Trung Le, Toan Tran, Nhat Ho, and Dinh PhungAdvances in Neural Information Processing Systems 2022
Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: "Can we derive a probabilistic version of the multi-objective optimization?". To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.
- Reducing Catastrophic Forgetting in Neural Networks via Gaussian Mixture ApproximationHoang Phan*, Phan Tuan Anh*, Son Nguyen*, Ngo Van Linh, and Khoat ThanAdvances in Knowledge Discovery and Data Mining. PAKDD. Lecture Notes in Computer Science. Springer 2022
Our paper studies the continual learning (CL) problems in which data comes in sequence and the trained models are expected to be capable of utilizing existing knowledge to solve new tasks without losing performance on previous ones. This also poses a central difficulty in the field of CL, termed as Catastrophic Forgetting (CF). In an attempt to address this problem, Bayesian methods provide a powerful principle, focusing on the inference scheme to estimate the importance of weights. Variational inference (VI), one of the most widely used methods within this vein, approximates the intractable posterior by a factorized distribution, thus offering computational efficiency. Notwithstanding many state-of-the-art performances in practice, this simple assumption about the posterior distribution typically limits the model capacity to some extent. In this paper, we introduce a novel approach to mitigate forgetting in the Bayesian approach via enriching the posterior distribution with mixture models, which intuitively promotes neural networks to acquire knowledge from multiple tasks at a time. Moreover, in order to reduce the model’s complexity growth when the number of components increases, we propose a solution that conducts low-rank decomposition on the variance of each component based on neural matrix factorization. Extensive experiments show that our method yields significant improvements compared to prior works on different benchmarks.
2021
- Matching the statements: A simple and accurate model for key point analysisHoang Phan, Long Nguyen, and Khanh DoanProceedings of the 8th Workshop on Argument Mining 2021
Key Point Analysis (KPA) is one of the most essential tasks in building an Opinion Summarization system, which is capable of generating key points for a collection of arguments toward a particular topic. Furthermore, KPA allows quantifying the coverage of each summary by counting its matched arguments. With the aim of creating high-quality summaries, it is necessary to have an in-depth understanding of each individual argument as well as its universal semantic in a specified context. In this paper, we introduce a promising model, named Matching the Statements (MTS) that incorporates the discussed topic information into arguments/key points comprehension to fully understand their meanings, thus accurately performing ranking and retrieving best-match key points for an input argument. Our approach has achieved the 4th place in Track 1 of the Quantitative Summarization – Key Point Analysis Shared Task by IBM, yielding a competitive performance of 0.8956 (3rd) and 0.9632 (7th) strict and relaxed mean Average Precision, respectively.
- Contrastive learning for natural language-based vehicle retrievalTam Nguyen, Quang Pham, Linh Doan, Hoang Trinh, Anh Nguyen, and Hoang PhanProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2021
AI City Challenge 2021 Task 5: The Natural Language-Based Vehicle Tracking is a Natural Language-based Vehicle Retrieval task, which requires retrieving a single-camera track using a set of three natural language descriptions of the specific targets. In this paper, we present our methods to tackle the difficulties of the provided task. Experiments with our approaches on the competitive dataset from AICity Challenge 2021 show that our techniques achieve Mean Reciprocal Rank score of 0.1701 on the public test dataset and 0.1571 on the private test dataset.