热点、深度、趋势全掌握,尽在BTC区块圈

数字取证和欺诈分析变得更容易了

_欺诈后果_欺诈问题

数字取证和欺诈分析变得更容易了

在 LastingAsset,我们正在研究具有隐私意识的欺诈检测方法,显然,一个关键方面是解析文档以获取其概念,然后将它们与各种文档进行匹配。为此,我们正在实施加密方法,以支持在加密安全存储上搜索人工制品。

我们可以用来解析文档的一种方法是使用 GenAI 引擎,例如 Llama。为此,我们可以使用 LlamaCloud,它支持解析几乎所有我们需要的形式的文档,例如 PDF、DOCX、PPTX、XLSX 等。此服务使用 LlamaIndex。为此,我们可以生成一个可以根据需要运行的模型。以下代码允许我们从示例 PDF 文档中提取英国邮政编码、IP 地址、电子邮件地址、银行详细信息、电话号码、MAC 地址和英国城市:

import os
from llama_cloud_services import LlamaExtract
from pydantic import BaseModel, Field
LLAMA_CLOUD_API_KEY =os.environ['LLAMA_CLOUD_API_KEY']
class ExtractArtefacts(BaseModel):
    postcode: str = Field(description="Extract all of the UK postcodes from the the document") # 从文档中提取所有英国邮政编码
    ip_addresses: str = Field(description="Find all the IP address") # 查找所有 IP 地址
    email_address: str = Field(description="Find all the email addresses") # 查找所有电子邮件地址
    bank_details: str = Field(description="Find all the bank details and sort codes") # 查找所有银行详细信息和分类代码
    telephone: str= Field(description="Find all the telephone addresses and their location") # 查找所有电话地址及其位置
    passwords: str= Field(description="Find all the passwords") # 查找所有密码
    credit_card: str = Field(description="Find all the credit card details") # 查找所有信用卡详细信息
    mac_address: str = Field(description="Find all the MAC addresses") # 查找所有 MAC 地址
    cities: str = Field(description="Find all the UK cities or towns") # 查找所有英国城市或城镇
llama_extract = LlamaExtract()
from llama_cloud.types import ExtractConfig, ExtractMode
config = ExtractConfig(use_reasoning=True,cite_sources=True,
    extraction_mode=ExtractMode.MULTIMODAL)
agent = llama_extract.create_agent(name="artefact-parser", data_schema=ExtractArtefacts, config=config)
## agent = llama_extract.get_agent(name="artefact-parser")
artefact_info = agent.extract("mydoc.pdf")
print(artefact_info.data)
print(artefact_info.extraction_metadata)

要使用此功能,我们需要一个 API 密钥。创建模型后,它会被添加到 Llama Cloud :

_欺诈问题_欺诈后果

之后,我们就可以直接调用该模型:

## agent = llama_extract.create_agent(name="artefact-parser",
    data_schema=ExtractArtefacts, config=config)
agent = llama_extract.get_agent(name="artefact-parser")
artefact_info = agent.extract("mydoc.pdf")
print(artefact_info.data)
print(artefact_info.extraction_metadata)

然后,我们可以将一些内容放入相关的 PDF 文档中:

欺诈后果_欺诈问题_

其中包含以下内容:

There is not much we can do apart from contacting,  there is not much we can
do apart from contacting f.smith@home.net to see if he would like to reboot
the server at 192.168.0.1. If he can do this then I will call him on
444.3212.5431. My credit card details are 4321-4444-5412-2310 and
5430-5411-4333-5123 and my name on the card is Fred Smith. I really like
the name domain fred@home.
Overall our target areas are SW1 7AF and EH105DT. I tested the server last
night, and I think the IP address is 10.0.0.1 and 192.168.1.1 and there are
two MAC addresses which is 01:23:45:67:89:ab or it might be 00.11.22.33.44.55.
The book we will use is "At Home" and it can be bought on amazon.com or google.com, if you search for 978-1-4302-1998-9. My account email addresses are  Fred.blogs@gmail.com and f.blogs@mail.com.
I think my password might be "Qwerty123" or "inkwell!!".
Here are the details that I have:
IBAN Sort code Account
---------------------------------------
GB91BKEN10000041610008 100000 41610008
GB27BOFI90212729823529  902127 29823529
GB17BOFS80055100813796 800551 00813796
GB92BARC20005275849855 200052 75849855
Shall we perhaps meet in Glasgow or Edinburgh, or even Stirling?
If you need to access the account, the password is: a1b2c3
Best regards,
Bert.
EH14 1DJ
+44 (960) 000 00 00
1/1/2009

对于 MULTIMODAL 和 BALANCED 模式,成本约为 14 个 credits,其中 1,000 个 credits 为 1 美元。总的来说,提取的信息是:

{
'postcode': 'SW1 7AF, EH105DT, EH14 1DJ',
'ip_addresses': '192.168.0.1, 10.0.0.1, 192.168.1.1',
'email_address': 'f.smith@home.net, Fred.blogs@gmail.com, f.blogs@mail.com',
'bank_details': 'GB91BKEN10000041610008, 100000, 41610008; GB27BOFI90212729823529, 902127, 29823529; GB17BOFS80055100813796, 800551, 00813796; GB92BARC20005275849855, 200052, 75849855', 'telephone': '444.3212.5431, +44 (960) 000 00 00',
'passwords': 'Qwerty123, inkwell!!, a1b2c3', 'credit_card': '4321-4444-5412-2310, 5430-5411-4333-5123',
'mac_address': '01:23:45:67:89:ab, 00.11.22.33.44.55',
'cities': 'Glasgow, Edinburgh, Stirling'}
{'field_metadata': {
'postcode': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'SW1 7AF and EH105DT'}, {'page': 1, 'matching_text': 'EH14 1DJ'}]},
'ip_addresses': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '192.168.0.1'}, {'page': 1, 'matching_text': '10.0.0.1 and 192.168.1.1'}]},
'email_address': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'f.smith@home.net'}, {'page': 1, 'matching_text': 'Fred.blogs@gmail.com and f.blogs@mail.com'}]},
'bank_details': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '| GB91BKEN10000041610008 | 100000    | 41610008 |'}, {'page': 1, 'matching_text': '| GB27BOFI90212729823529 | 902127    | 29823529 |'}, {'page': 1, 'matching_text': '| GB17BOFS80055100813796 | 800551    | 00813796 |'}, {'page': 1, 'matching_text': '| GB92BARC20005275849855 | 200052    | 75849855 |'}]},
'telephone': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '444.3212.5431'}, {'page': 1, 'matching_text': '+44 (960) 000 00 00'}]},
'passwords': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'password might be "Qwerty123" or "inkwell!!"'}, {'page': 1, 'matching_text': 'the password is: a1b2c3'}]},
'credit_card': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '4321-4444-5412-2310 and 5430-5411-4333-5123'}]},
'mac_address': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': '01:23:45:67:89:ab or it might be 00.11.22.33.44.55'}]},
'cities': {'reasoning': 'VERBATIM EXTRACTION', 'citation': [{'page': 1, 'matching_text': 'meet in Glasgow or Edinburgh, or even Stirling'}]}}, 'usage': {'num_pages_extracted': 1, 'num_document_tokens': 461, 'num_output_tokens': 1082}}

或者我们可以得到没有引用的版本:

config = ExtractConfig(use_reasoning=True,cite_sources=True,
    extraction_mode=ExtractMode.MULTIMODAL)

这给出的关于解析原因的细节较少:

{'postcode': 'SW1 7AF, EH105DT, EH14 1DJ',
'ip_addresses': '192.168.0.1, 10.0.0.1, 192.168.1.1',
'email_address': 'f.smith@home.net, Fred.blogs@gmail.com, f.blogs@mail.com',
'bank_details': 'IBANs: GB91BKEN10000041610008, GB27BOFI90212729823529, GB17BOFS80055100813796, GB92BARC20005275849855; Sort codes: 100000, 902127, 800551, 200052; Accounts: 41610008, 29823529, 00813796, 75849855',
'telephone': '444.3212.5431, +44 (960) 000 00 00',
'passwords': 'Qwerty123, inkwell!!, a1b2c3', 'credit_card': '4321-4444-5412-2310, 5430-5411-4333-5123 (Name: Fred Smith)',
'mac_address': '01:23:45:67:89:ab, 00.11.22.33.44.55',
'cities': 'Glasgow, Edinburgh, Stirling'}
{'field_metadata': {'postcode': {'reasoning': 'VERBATIM EXTRACTION'},
'ip_addresses': {'reasoning': 'VERBATIM EXTRACTION'},
'email_address': {'reasoning': 'VERBATIM EXTRACTION'},
'bank_details': {'reasoning': 'VERBATIM EXTRACTION'},
'telephone': {'reasoning': 'VERBATIM EXTRACTION'},
'passwords': {'reasoning': 'VERBATIM EXTRACTION'},
'credit_card': {'reasoning': 'VERBATIM EXTRACTION'},
'mac_address': {'reasoning': 'VERBATIM EXTRACTION'},
'cities': {'reasoning': 'VERBATIM EXTRACTION'}},
'usage': {'num_pages_extracted': 1, 'num_document_tokens': 461, 'num_output_tokens': 694}}

结论

正则表达式的时代已经过去了,我们欢迎智能解析的新工作方式。

使用本文
0
共享
上一篇

ZachXBT 批评 Sui 网络安全基础设施薄弱及应对漏洞不力

下一篇

英伟达市值两个月内反弹至1万亿美元,市场看好未来增长潜力

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

阅读下一页

以太坊基金会的未来篇章

本文是由以太坊基金会(EF)新任联合执行董事Hsiao-Wei和Tomasz共同撰写的博客文章,阐述了EF未来的发展方向和重点。