Hassan Ali

and 6 more

Deep Learning (DL) algorithms have shown wonders in many Natural Language Processing (NLP) tasks such as language-to-language translation, spam filtering, fake-news detection, and comprehension understanding. However, research has shown that the adversarial vulnerabilities of deep learning networks manifest themselves when DL is used for NLP tasks. Most mitigation techniques proposed to date are supervised—relying on adversarial retraining to improve the robustness—which is impractical. This work introduces a novel, unsupervised detection methodology for detecting adversarial inputs to NLP classifiers. In summary, we note that minimally perturbing an input to change a model’s output—a major strength of adversarial attacks—is a weakness that leaves unique statistical marks reflected in the cumulative contribution scores of the input. Particularly, we show that the cumulative contribution score, called CF-score, of adversarial inputs is generally greater than that of the clean inputs. We thus propose Con-Detect—a Contribution based Detection method—for detecting adversarial attacks against NLP classifiers. Con-Detect can be deployed with any classifier without having to retrain it. We experiment with multiple attackers—Text-bugger, Text-fooler, PWWS—on several architectures—MLP, CNN, LSTM, Hybrid CNN-RNN, BERT—trained for different classification tasks—IMDB sentiment classification, fake-news classification, AG news topic classification—under different threat models—Con-Detect-blind attacks, Con-Detect-aware attacks, and Con-Detect-adaptive attacks—and show that Con-Detect can reduce the attack success rate (ASR) of different attacks from 100% to as low as 0% for the best cases and =70% for the worst case. Even in the worst case, we note a 100% increase in the required number of queries and a 50% increase in the number of words perturbed, suggesting that Con-Detect is hard to evade.

Hassan Ali

and 7 more

Recent works have highlighted how misinformation is plaguing our online social networks. Numerous algorithms on automated misinformation detection are centered around deep learning~(DL) which requires large data for training. However, privacy and ethical concerns reduce data sharing by stakeholders, impeding data-driven misinformation detection. Current data encryption techniques providing privacy guarantees cannot be naively extended to text inference with DL models, mainly due to the errors induced by stacked encrypted operations and polynomial approximations of the otherwise encryption-incompatible non-polynomial operations. In this paper, we show, formally and empirically, the effectiveness of (1) $L_2$ regularized training to reduce the overall error induced by approximate polynomial activations, and (2) sigmoid activation to regulate the error accumulated due to cascaded operations over encrypted data. We assume a federated learning-encrypted inference~(FL-EI) setup for text-based misinformation detection as a (secure and privacy-aware cloud) service, where classifiers are securely trained in FL framework and inference is performed on homomorphically encrypted data. We evaluate three architectures—Logistic Regression~(LR), Multilayer Perceptron~(MLP), and Self-Attention Network~(SAN)—on two public text-misinformation datasets with some interesting results, for example, by simply replacing ReLU activation with sigmoid, we were able to reduce the output error by $1750\times$ in the best case to $43.75\times$ in the worst case.

Hassan Ali

and 3 more

While the technique of Deep Neural Networks (DNNs) has been instrumental in achieving state-of-the-art results for various Natural Language Processing (NLP) tasks, recent works have shown that the decisions made by DNNs cannot always be trusted. Recently Explainable Artificial Intelligence (XAI) methods have been proposed as a method for increasing DNN’s reliability and trustworthiness. These XAI methods are however open to attack and can be manipulated in both white-box (gradient-based) and black-box (perturbation-based) scenarios. Exploring novel techniques to attack and robustify these XAI methods is crucial to fully understand these vulnerabilities. In this work, we propose Tamp-X—a novel attack which tampers the activations of robust NLP classifiers forcing the state-of-the-art white-box and black-box XAI methods to generate misrepresented explanations. To the best of our knowledge, in current NLP literature, we are the first to attack both the white-box and the black-box XAI methods simultaneously. We quantify the reliability of explanations based on three different metrics—the descriptive accuracy, the cosine similarity, and the Lp norms of the explanation vectors. Through extensive experimentation, we show that the explanations generated for the tampered classifiers are not reliable, and significantly disagree with those generated for the untampered classifiers despite that the output decisions of tampered and untampered classifiers are almost always the same. Additionally, we study the adversarial robustness of the tampered NLP classifiers, and find out that the tampered classifiers which are harder to explain for the XAI methods, are also harder to attack by the adversarial attackers.

Hassan Ali

and 3 more

We have witnessed the continuing arms race between backdoor attacks and the corresponding defense strategies on Deep Neural Networks (DNNs). However, most state-of-the-art defenses rely on the statistical sanitization of inputs or latent DNN representations to capture trojan behavior. In this paper, we first challenge the robustness of many recently reported defenses by introducing a novel variant of the targeted backdoor attack, called low-confidence backdoor attack. Low-confidence attack inserts the backdoor by assigning uniformly distributed probabilistic labels to the poisoned training samples, and is applicable to many practical scenarios such as Federated Learning and model-reuse cases. We evaluate our attack against five state-of-the-art defense methods, viz., STRIP, Gradient-Shaping, Februus, ULP-defense and ABS-defense, under the same threat model as assumed by the respective defenses and achieve Attack Success Rates (ASRs) of 99\%, 63.73%, 91.2%, 80% and 100%, respectively. After carefully studying the properties of the state-of-the-art attacks, including low-confidence attacks, we present HaS-Net, a mechanism to securely train DNNs against a number of backdoor attacks under the data-collection scenario. For this purpose, we use a reasonably small healing dataset, approximately 2% to 15% the size of training data, to heal the network at each iteration. We evaluate our defense for different datasets—Fashion-MNIST, CIFAR-10, Celebrity Face, Consumer Complaint and Urban Sound—and network architectures—MLPs, 2D-CNNs, 1D-CNNs—and against several attack configurations—standard backdoor attacks, invisible backdoor attacks, label-consistent attack and all-trojan backdoor attack, including their low-confidence variants. Our experiments show that HaS-Nets can decrease ASRs from over 90% to less than 15%, independent of the dataset, attack configuration and network architecture.