GoogleのCloud Natural Languageを使って文書から日本の住所を抜き出してみる

前回の

Google Cloud Vision APIを使って画像から文字を読み取る

に引き続き、今回はGoogleさんが提供しているNLP（自然言語処理）のAPIを試してみたいと思います。

今回の目的としては、ある文章に住所が含まれれば、その住所を抜き出すことが
目的です。

これも、やってみたレベルですし、前回に引き続き、Googleさんがサンプルコードを提供してくれているのでコードのオリジナリティはまったくないですが（笑）、やり方と現時点（2021年7月21日）での Cloud Natural Language が、日本語の住所が文章から抜き出せるかという結果などをまとめておきたいので書いておきます。

さて、GoogleさんのCloud Natural Languageには3種類あります。一つは医療文書向けなので、今回省くとして、ほかの二つはというと下記の二つです。

https://cloud.google.com/natural-language

ふみゅ。私なりに解釈すると、Auto MLは自分でAIをトレーニングしないといけないみたいですね。なので、今回はNatural Language APIというのを使います。

①まずは、Natural Language APIを設定

前回のVision APIと手順は一緒です。面倒ですが、やりましょう。

ちなみに、別にVision APIはNatural Language APIに必要なわけではなく、私がVision API→Natural Language APIの順番でやっているだけです。

手順は下記に書かれています。

https://cloud.google.com/natural-language/docs/setup

②クライアントライブラリをインストール

私は今回もPythonでやるので、

pip install --upgrade google-cloud-language

とやります。

③いよいよコードを実行

今回は、下記のサンプルを使います。

なぜなら、目的は文章から住所を抜き出すことが目的なので、エンティティ分析というのをやってみたいからです。

https://cloud.google.com/natural-language/docs/samples/language-entities-text

このまま使うと

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

というエラーが出ます。前回と一緒ですね！なので、キーファイルをコード内で指定します。

全文は下記の通りです。

from google.cloud import language_v1
import os

#下記はサンプルなので、ご自分のキーファイルの場所に置き換えてください。
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="D:/natural_language_api_key/key.json"

def sample_analyze_entities(text_content):
    """
    Analyzing Entities in a String

    Args:
      text_content The text content to analyze
    """

    client = language_v1.LanguageServiceClient()
    type_ = language_v1.Document.Type.PLAIN_TEXT

    # サンプルではen ですが、日本語でjaに変更します。
    language = "ja"
    
    document = {"content": text_content, "type_": type_, "language": language}

    # Available values: NONE, UTF8, UTF16, UTF32
    encoding_type = language_v1.EncodingType.UTF8

    response = client.analyze_entities(request = {'document': document, 'encoding_type': encoding_type})

    # Loop through entitites returned from the API
    for entity in response.entities:
        print(u"Representative name for the entity: {}".format(entity.name))

        print(u"Entity type: {}".format(language_v1.Entity.Type(entity.type_).name))

        print(u"Salience score: {}".format(entity.salience))

        # Loop over the metadata associated with entity. For many known entities,
        # the metadata is a Wikipedia URL (wikipedia_url) and Knowledge Graph MID (mid).
        # Some entity types may have additional metadata, e.g. ADDRESS entities
        # may have metadata for the address street_name, postal_code, et al.
        for metadata_name, metadata_value in entity.metadata.items():
            print(u"{}: {}".format(metadata_name, metadata_value))

        # Loop over the mentions of this entity in the input document.
        # The API currently supports proper noun mentions.
        for mention in entity.mentions:
            print(u"Mention text: {}".format(mention.text.content))

            # Get the mention type, e.g. PROPER for proper noun
            print(
                u"Mention type: {}".format(language_v1.EntityMention.Type(mention.type_).name)
            )

    # Get the language of the text, which will be the same as
    # the language specified in the request or, if not specified,
    # the automatically-detected language.
    print(u"Language of the text: {}".format(response.language))

sample_analyze_entities('昔々、あるところに一匹のこざるがおりました。')

‘昔々、あるところに一匹のこざるがおりました。’

という文章をエンティティ分析にかけたわけです。

結果は次の通り。

Representative name for the entity: ところ
 Entity type: OTHER
 Salience score: 1.0
 Mention text: ところ
 Mention type: COMMON
 Representative name for the entity: 1
 Entity type: NUMBER
 Salience score: 0.0
 value: 1
 Mention text: 1
 Mention type: TYPE_UNKNOWN
 Language of the text: ja

“ところ”、と漢数字の”一”をちゃんと認識してくれていますね！

entityの各意味については、下記をご覧ください。

https://cloud.google.com/natural-language/docs/reference/rest/v1beta2/Entity

さて、もっと具体的な住所を入れてみます。エンティティ分析をする文章を次のように変更します。

‘昔々、神奈川県横浜市西区北幸２丁目10−39 日総第5ビル 9Fに一匹のこざるがおりました。’

これでやってみると、ドン！

Representative name for the entity: 神奈川県

Entity type: LOCATION

Salience score: 0.28228825330734253

wikipedia_url: https://en.wikipedia.org/wiki/Kanagawa_Prefecture

mid: /m/0gqm3

Mention text: 神奈川県

Mention type: PROPER

Representative name for the entity: 横浜市

Entity type: LOCATION

Salience score: 0.2538958787918091

wikipedia_url: https://en.wikipedia.org/wiki/Yokohama

mid: /m/0kstw

Mention text: 横浜市

Mention type: PROPER

Representative name for the entity: 西区

Entity type: LOCATION

Salience score: 0.2538958787918091

mid: /m/0d28cy

wikipedia_url: https://en.wikipedia.org/wiki/Nishi-ku,_Yokohama

Mention text: 西区

Mention type: PROPER

Representative name for the entity: 北幸

Entity type: LOCATION

Salience score: 0.2099200040102005

mid: /g/121d1tc6

wikipedia_url: https://ja.wikipedia.org/wiki/北幸

Mention text: 北幸

Mention type: PROPER

Representative name for the entity: 神奈川県横浜市西区北幸２丁目10−39

Entity type: ADDRESS

Salience score: 0.0

locality: 横浜市

street_number: 1039

sublocality: 西区

broad_region: 神奈川県

country: JP

Mention text: 神奈川県横浜市西区北幸２丁目10−39

Mention type: TYPE_UNKNOWN

Representative name for the entity: ２

Entity type: NUMBER

Salience score: 0.0

value: 2

Mention text: ２

Mention type: TYPE_UNKNOWN

Representative name for the entity: 10

Entity type: NUMBER

Salience score: 0.0

value: 10

Mention text: 10

Mention type: TYPE_UNKNOWN

Representative name for the entity: 39

Entity type: NUMBER

Salience score: 0.0

value: 39

Mention text: 39

Mention type: TYPE_UNKNOWN

Representative name for the entity: 5

Entity type: NUMBER

Salience score: 0.0

value: 5

Mention text: 5

Mention type: TYPE_UNKNOWN

Representative name for the entity: 9

Entity type: NUMBER

Salience score: 0.0

value: 9

Mention text: 9

Mention type: TYPE_UNKNOWN

Representative name for the entity: 一

Entity type: NUMBER

Salience score: 0.0

value: 1

Mention text: 一

Mention type: TYPE_UNKNOWN

Language of the text: ja

一気に情報量が増えましたね。( ˊᵕˋ )

下記のあたりに注目すると、

Representative name for the entity: 神奈川県横浜市西区北幸２丁目10−39

Entity type: ADDRESS

Salience score: 0.0

locality: 横浜市

street_number: 1039

sublocality: 西区

broad_region: 神奈川県

country: JP

Mention text: 神奈川県横浜市西区北幸２丁目10−39

Mention type: TYPE_UNKNOWN

神奈川県横浜市西区北幸２丁目10−39は住所（Entity typeがADDRESS）として認識できたわけです！

しかも、神奈川県→横浜市→西区　と構造も認識できていますね。しかし、番地がどうかというと

street_number: 1039

となっていて、10-39という形の住所が認識できなかったようです（泣）。

また、ビル名とフロア数の「日総第5ビル 9F」は、住所の一部として認識してくれなかったようです。（泣）

これは、「日総第5ビル 9F」の後に半角スペースを入れて、文章を

‘昔々、神奈川県横浜市西区北幸２丁目10−39 日総第5ビル 9F に一匹のこざるがおりました。’

としてみても結果は一緒でした。

GoogleのCloud Natural Languageを使って文書から日本の住所を抜き出してみる

①まずは、Natural Language APIを設定

②クライアントライブラリをインストール

③いよいよコードを実行

いいね:

関連

コメントを残すコメントをキャンセル

①まずは、Natural Language APIを設定

②クライアントライブラリをインストール

③いよいよコードを実行

共有:

いいね:

関連

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル