多个网站对词嵌入 (Embedding) 的解释

发表于 2024-09-02 更新于 2025-05-12

https://www.ibm.com/topics/embedding

Embedding is a means of representing objects like text, images and audio as points in a continuous vector space where the locations of those points in space are semantically meaningful to machine learning (ML) algorithms.

https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/embeddings?tabs=console

An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format.

https://aws.amazon.com/cn/what-is/embeddings-in-machine-learning/

嵌入是真实世界对象的数字表示，机器学习（ML）和人工智能（AI）系统利用它来像人类一样理解复杂的知识领域。例如，计算算法了解 2 和 3 之间的差为 1，这表明与 2 和 100 相比，2 和 3 关系更为密切。但是，真实世界数据包含更复杂的关系。例如，鸟巢和狮穴是相似对，而昼夜是相反词。嵌入将真实世界的对象转换成复杂的数学表示，以捕捉真实世界数据之间的固有属性和关系。整个过程是自动化的，人工智能系统会在训练期间自我创建嵌入，并根据需要使用它们来完成新任务。
机器学习模型无法以原始格式明确解读信息，需要以数值数据作为输入。它们使用神经网络嵌入将实词信息转换为称为向量的数字表示形式。向量是以多维空间形式表示信息的数值。它们可以帮助机器学习模型找到稀疏分布的项目间的相似之处。

https://openai.com/index/introducing-text-and-code-embeddings/

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.
Embeddings that are numerically similar are also semantically similar.

https://huggingface.co/blog/getting-started-with-embeddings

An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded, making it robust for many industry applications.
… Since this list captures the meaning, we can do exciting things, like calculating the distance between different embeddings to determine how well the meaning of two sentences matches.

https://datascience.stackexchange.com/a/101720

Although the word thus originally meant the mapping from one space to another, it has metonymically shifted to mean the resulting dense vector in the latent space. and it is in this sense that we currently use the word.

《大语言模型》 P2-3 https://llmbook-zh.github.io/

图灵奖获得者 Yoshua Bengio 在一项早期工作中 [6] 引入了分布式词表示（DistributedWord Representation）这一概念，并构建了基于聚合上下文特征（即分布式词向量）的目标词预测函数。分布式词表示使用低维稠密向量来表示词汇的语义，这与基于词典空间的稀疏词向量表示（One-Hot Representation）有着本质的不同，能够刻画更为丰富的隐含语义特征。同时，稠密向量的非零表征对于复杂语言模型的搭建非常友好，能够有效克服统计语言模型中的数据稀疏问题。分布式词向量又称为 “词嵌入”（Word Embedding）。

Win10 下本地打包 miniforge 的坑

发表于 2022-01-22 更新于 2022-01-23

(注：本文所说的打包是搭建conda环境 -> 用搭建好的conda环境安装constructor -> 用constructor生成安装包这样的一个流程，或许这个词不够规范，还望海涵。)

前言

突发奇想（我怎么就管不住我这双手呢.jpg）想试试在本地从源码构建一次 conda。然而在看了官方和 mamba 的 documentation 之后才发现在不预先安装好 conda 环境的情况下想要从头搭建出一个 conda 环境基本不可能。在几天的搜索后决定变更计划，转而用 miniforge 项目里用来 bootstrap 的程序 micromamba 先创建一个 conda 环境，在此之上安装 constructor，再打包出一个 miniforge 的安装包。

过程

略……（CI 多香啊我为什么就是脑抽了不肯用呢)

没有技术含量的技术总结 (笑)

miniforge 的脚本是跟 git-bash 适配的，用自己安装的 msys2 或者 cygwin 可能会有坑。目测有些问题是因为 msys2 和 cygwin 的 home 目录并不是 c:\users\username，而 micromamba 和 miniforge 在读取 HOME 目录的时候可能会搞混。更重要的是 build.sh 脚本会用 uname 检测当前平台，而 git-bash 中 uname 的输出才是 MINGW*。

miniforge 的 build.sh 脚本中 micromamba 的链接可能是无效的。(2022.01.22) 自行下载 micromamba，然后改脚本就行了：

if [[ "${TARGET_PLATFORM}" != win-* ]]; then
    MICROMAMBA_VERSION=0.17.0
    mkdir "${TEMP_DIR}/micromamba"
    pushd "${TEMP_DIR}/micromamba"
    #curl -L -O "https://anaconda.org/conda-forge/micromamba/${MICROMAMBA_VERSION}/download/${TARGET_PLATFORM}/micromamba-${MICROMAMBA_VERSION}-0.tar.bz2"
    cp "/path/to/download/micromamba-${MICROMAMBA_VERSION}-0.tar.bz2" ./
    tar -xf "micromamba-${MICROMAMBA_VERSION}-0.tar.bz2"

有一些变量是在 miniforge 的 CI 里定义的，本地编译时需要手动提供。比如，在执行 build_miniforge_win.sh 之前先 export TARGET_PLATFORM="win-64"，否则在调用 constructor 时可能会报错。或者修改 build.sh，在前面添加 TARGET_PLATFORM 变量：
1
TARGET_PLATFORM="win-64"
想要编译 Mambaforge，需要添加变量 MINIFORGE_NAME="Mambaforge"，在运行 build_miniforge_win.sh 之前先添加到环境变量 export MINIFORGE_NAME="Mambaforge"，或者直接修改 Miniforge3/construct.yaml：
1
2
{% set name = os.environ.get("MINIFORGE_NAME", "Mambaforge") %}
# 修改配置后就不要添加MINIFORGE_NAME环境变量了
还是已经配置好的 github action 省时省力啊 #(趴)

在首页屏蔽带有 R18 内容的博文 [NexT 主题]

发表于 2022-01-10

存在敏感内容，请点击查看

next 主题加入 live2d-widget 看板娘

发表于 2022-01-02

想给博客加一个看板娘，一顿百度之后发现很多教程提到的 hexo-helper-live2d 已经不更新了，于是顺着链接又找到了 live2d-widget，就决定用它了。
由于 next 自带了 live2d-widget 需要的 font-awesome，所以就不需要再添加了。直接在 hexo-theme-next/layout/_layout.swig 的 <head> 里面添加

1	<script src="https://cdn.jsdelivr.net/gh/stevenjoezhang/live2d-widget@latest/autoload.js"></script>

搞定。

Hello World

发表于 2022-01-01 更新于 2022-01-02

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment