Researchers created a new AI model with a much darker twist, following the success of OpenAI’s ChatGPT, Microsoft’s Bing Chat, and Google Bard. DarkBERT AI, unlike ChatGPT and Google Bard, trained exclusively on data from the dark web instead of the open web. Hackers, cybercriminals, and other scammers provided the data used to train this new AI model.
A group of South Korean academics developed DarkBERT AI using information sourced from the Tor network, commonly utilized to access the dark web. They described their method in a published paper, detailing the process of crawling the dark web, filtering the raw data, and training DarkBERT on it.
Distinct from existing models like ChatGPT and Google Bard, researchers have introduced DarkBERT. This AI model stands out as it was trained solely on data from the dark web, obtained from hackers, cybercriminals, and scammers. The South Korean research team built a dark web database by crawling through the Tor network. Despite its unconventional training data, DarkBERT has demonstrated superior performance compared to other large language models.
DarkBERT, a new AI model based on the RoBERTa architecture established by researchers at Facebook in 2019, builds on BERT (Bidirectional Encoder Representations from Transformers) released by Google in 2018. Meta AI’s research paper describes DarkBERT as a “robustly optimized method for pretraining natural language processing (NLP) systems.” Facebook researchers attribute the enhancement of BERT’s effectiveness in a replication trial to its open-source availability.
Facebook published RoBERTa, which achieved cutting-edge scores on the General Language Understanding Evaluation (GLUE) NLP benchmark, thanks to its improved methodology.
Initially released with insufficient training, RoBERTa has been further developed by South Korean academics working on DarkBERT. They fed RoBERTa data from the dark web for almost 16 days using two datasets, one raw and the other preprocessed.
The researchers utilized the anonymizing firewall of the Tor network to crawl the Dark Web and collected raw data. This data was then filtered to create a Dark Web database. DarkBERT is the outcome of feeding this database to the RoBERTa Large Language Model. This model can analyze new Dark Web content, which is often written in its unique dialects and consists of heavily-coded messages and extract valuable information from it.
Fortunately, the researchers do not plan to make DarkBERT accessible to the general public. However, scholarly requests are accepted, according to Dexerto. Nonetheless, DarkBERT is expected to provide investigators and law enforcement with a greater understanding of the entire dark web.
The team suggests that DarkBERT can be employed for various cybersecurity-related tasks, including the detection of ransomware-selling sites or the leakage of confidential data. Additionally, it can monitor the numerous dark web forums that receive daily updates and identify any exchange of illicit information.
The research primarily involved crawling the deep or dark web using Tor, which is the most popular browser for accessing such websites. Since these websites are not accessible through the “surface web,” the browser is necessary to access “onion links.” However, the research paper highlights that a significant majority of these links now lead to error codes or pages with minimal information.
DarkBERT currently has no intentions of making its release to the general public, and there is a strong emphasis on not releasing the dataset to the public. However, academic requests can be made considering the nature of the materials found on the dark web.
DarkBERT demonstrates its utility in researching the dark web and can identify cyber security threats such as ransomware and leak site detection, among others. This new model exhibits promising potential for future research within the dark web domain and the cyber security industry. It surpasses existing language models in terms of performance when evaluated on tasks and datasets related to the dark web domain.
Conclusion
Researchers and security professionals can delve deeper into the mysterious realm of the dark web thanks to DarkBERT. DarkBERT significantly advances the exploration and analysis of the dark web, empowering them to uncover its secrets. Its capabilities offer great promise for enhancing dark web research and strengthening cybersecurity efforts. However, DarkBERT also presents ethical considerations and limitations that demand careful attention. As DarkBERT continues to evolve, it holds the potential to revolutionize our understanding of the dark web and bolster our defenses against emerging cyber threats.
DarkBERT uses a two-component framework that consists of a deep learning network architecture and a defense algorithm. The deep learning network provides the basic classification of text into malicious or benign, while the defense algorithm fine-tunes the classifications and mitigates any adversarial attacks.
DarkBERT can detect a range of attacks, such as sentence embedding manipulation, word substitution, and syntactic perturbation.
This website uses cookies.