- History of Encoding
- Level 1: One-hot vector = [0,0,0,1,0,0,0] (only one 1 (hot))
- Level 2: Tf-idf = words that appear frequently globally are not important, while words that appear frequently locally are important
- Level 3: Word2vec = [x,x,x] king - distance(male,female) (includes some essence of tf-idf)
- Level 4: BERT
- BERT cannot understand pronouns like he/she
- But, it can infer from the surrounding words (word sense disambiguation)
- BERT cannot understand pronouns like he/she
https://ishitonton.hatenablog.com/entry/2018/11/25/200332
- About Embedding
(Expert in Information Science)
-
Overlapping field of Linguistics and Information Science
-
Approach
- Want to understand the mechanism of human input/output of Language
- Need to use Neuroscience to understand how the brain processes information
- Therefore, explore the mechanism through observable language
-
Some things to do
- Machine Translation
- Text Comprehension
- Text Mining: Extracting information from massive natural language data like tweets
-
It’s not just about processing language
- It connects to the deep aspects of human intelligence, such as the meaning, knowledge, and emotions that language carries
- (blu3mo) A broader field than imagined
-
Methodology
-
Cannot process as mere strings (e.g., “ケヤキ” and “ケーキ” are similar as strings but have completely different meanings)
-
How to handle meaning
- What is meaning?: Something that humans can determine equivalence
- (Since we cannot observe the processing in our minds, we use observable equivalence judgments)
-
How to combine ”Discrete structures” and “Continuous regularities”
- The structure of natural language has clear correctness = there is a discrete value structure
- e.g., Changing one pixel in an image doesn’t have much impact, but changing one character in natural language is a big problem
- However, there is also ambiguity and uncertainty (statistical, continuous value properties)
- Directly related to the ambiguity of language
- In other words, it has a complex nature of both discrete and continuous aspects
- The structure of natural language has clear correctness = there is a discrete value structure
-
What to learn from the corpus
- Natural language text data is called “corpus”
- Can learn regularities and other things from the corpus
- e.g., Language Model (evaluating the naturalness of sentences)
-
The most commonly used technology is Machine Learning
-
-
- Technology to understand the syntax of sentences
- Described in detail above
-
- Technology to understand the meaning of sentences/words
- Described in detail above