Over time, I developed a certain google-fu and expertise in finding references, papers, and books online. Some of these tricks are not well-known, like checking the Internet Archive (IA) for books. I try to write down my search workflow, and give general advice about finding and hosting documents.

'Google-fu' or search skill is something I've prided myself ever since elementary school, when the librarian challenged the class to find things in the almanac; not infrequently, I'd win. The Internet is the greatest almanac of all, and to the curious, a never-ending cornucopia, so it makes me sad to see so many fail to find things---or not look at all.
从小学开始,当图书管理员挑战班级在年历中寻找东西时,"Google-fu"或搜索技能是我自己引以为傲的。不经常,我会赢。互联网是所有人中最伟大的历书,对于好奇的,永无止境的聚宝盆来说,所以看到这么多人找不到东西 - 或者根本不看。

Below, I've tried to provide, in a roughly chronological way, a flowchart of an online search.

Search {#search}

Preparation {#preparation}

The first thing you must do is develop a habit of searching when you have a question: "Google is your friend". Your only search guaranteed to fail is the one you never run. (
你必须做的第一件事是养成一个习惯,当你有一个问题时搜索:"谷歌是你的朋友"。您唯一保证失败的搜索是您从未运行的搜索。 (Beware trivial inconveniences!)

  1. Query syntax knowledge

Know your basic
Know your basic Boolean operators & the
& the key G search operators: double quotes for exact matches, hyphens for negation/exclusion, and
:完全匹配的双引号,否定/排除的连字符,和site: for search a specific website or specific directory of that website (eg
搜索特定网站或该网站的特定目录(例如foo site:gwern.net/docs/genetics/). You may also want to play with
)。你可能也想玩Advanced Search to understand what is possible. (There are
了解什么是可能的。 (有many more G search operators but they aren't necessarily worth learning, because they implement esoteric functionality and most seem to be buggy
2. Hotkey acceleration (
热键加速( strongly recommended )

Enable some kind of hotkey search with both prompt and copy-paste selection buffer, to turn searching Google (G)/Google Scholar (GS)/Wikipedia (WP) into a reflex.
使用提示和复制粘贴选择缓冲区启用某种热键搜索,将Google(G)/ Google学术搜索(GS)/维基百科(WP)搜索转化为反射。^2^{#fnref2} You should be able to search instinctively within a split second of becoming curious, with a few keystrokes. (If you can't use it while IRCing without the other person noting your pauses, it's not fast enough.)
{#fnref2}你应该能够通过几次击键在一瞬间内本能地进行搜索。 (如果你在IRCing时不能使用它而没有其他人注意你的停顿,那就不够快。)

Example tools:
Example tools: AutoHotkey (Windows),
(Windows), Quicksilver (Mac),
(Mac), xclip+
's search-engines/
's Actions.Search/
/Prompt.Shell (Linux).
(Linux). DuckDuckGo offers 'bangs', within-engine special searches (most are equivalent to a kind of Google
提供'刘海',引擎内特殊搜索(大多数相当于一种谷歌site: search), which can be used similarly or combined with prompts/macros/hotkeys.

I make heavy use of the XMonad hotkeys, which I wrote, and which gives me window manager shortcuts: while using any program, I can highlight a title string, and press
大量使用我编写的XMonad热键,它给了我窗口管理器的快捷方式:在使用任何程序时,我可以突出显示标题字符串,然后按Super-shift-y to open the current selection as a GS search in a new Firefox tab within an instant; if I want to edit the title (perhaps to add an author surname, year, or keyword), I can instead open a prompt,
在瞬间内在新的Firefox选项卡中打开当前选择作为GS搜索;如果我想编辑标题(也许是为了添加作者姓氏,年份或关键字),我可以改为打开提示,Super-y, paste with
, paste with C-y, and edit it before a
,并在之前编辑它\n launches the search. As can be imagined, this is extremely helpful for searching for many papers or for searching. (There are in-browser equivalents to these shortcuts but I disfavor them because they only work if you are in the browser, typically require more keystrokes or mouse use, and don't usually support hotkeys or searching the copy-paste selection buffer:
启动搜索。可以想象,这对于搜索许多论文或搜索非常有用。 (浏览器中有这些快捷方式的等价物,但我不喜欢它们,因为它们只有在浏览器中才有效,通常需要更多按键或鼠标使用,并且通常不支持热键或搜索复制粘贴选择缓冲区:Firefox,
, Chrome)
3. Web browser hotkeys
Web browser hotkeys

For navigating between sets of results and entries, you should have good command of your tabbed web browser. You should be able to go to the address bar, move left/right in tabs, close tabs, open new blank tabs, go to a specific tab, etc. (In Firefox, respectively:
, C-PgUp,
, C-PgDwn,
, C-w,
, C-t,
, M-[1-9].)

Searching {#searching}

Having launched your search in, presumably, GS, you must navigate the results.

In GS, remember that fulltext is not always denoted by a "[PDF]" link! Check the top hits by hand, there are often 'soft walls' which block web spiders but still let you download fulltext.

By Title {#by-title}

Title searches : if a paper fulltext doesn't turn up on the first page, start tweaking (hard rules cannot be given for this, it requires development of
:如果纸质全文没有出现在第一页,开始调整(硬性规则不能给出,它需要开发"mechanical sympathy" and asking a mixture of "how would a machine think to classify this" and "how would other people think to write this"):

  • Keep mind when searching, you want some but not too many or too few results. A few hundred hits in GS is around the sweet spot. If you have less than a page of hits, you have made your search too specific.
    在搜索时保持头脑,你想要一些但不是太多或太少的结果。 GS的几百次点击是最佳点。如果您的点击次数少于一页,则表明您的搜索过于具体。

If deleting a few terms then yields way too many hits, try to filter out large classes of hits with a negation
如果删除一些术语然后产生太多命中,请尝试过滤掉大量的命中并带有否定foo -bar, adding as many as necessary; also useful is using OR clauses to open up the search in a more restricted way by adding in possible synonyms, with parentheses for group. This can get quite elaborate---I have on occasion resorted to search queries as baroque as
,尽可能多地添加;使用OR子句通过添加可能的同义词,以及用于组的括号,以更加有限的方式打开搜索也很有用。这可以得到相当详细的说明---我偶尔使用像巴洛克式的搜索查询(foo OR baz) AND (qux OR quux) -bar -garply -waldo -fred. (By this point, it may be time to consider alternate attacks.)
。 (到目前为止,可能是考虑替代攻击的时候了。)

  • Tweak the title: quote the title; delete any subtitle; if there are colons, split it into two title quotes (instead of searching
    调整标题:引用标题;删除任何副标题;如果有冒号,则将其拆分为两个标题引号(而不是搜索Foo bar: baz quux, or
    , or "Foo bar: baz quux", search
    , search "Foo bar" "baz quux"); swap their order.

  • Add/remove the year.

  • Add/remove the first author.

  • Delete any unusual characters or punctuation. (Libgen had trouble with colons for a long time, and many websites still do.)
    删除任何不寻常的字符或标点符号。 (Libgen长期以来一直遇到冒号问题,很多网站仍然这样做。)

  • Use GS's date range to search ±4 years (metadata can be wrong, publishing conventions can be odd, publishers can be
    使用GS的日期范围搜索±4年(元数据可能是错误的,出版惯例可能很奇怪,出版商可以 extremely slow). If a year is not specified, try to guess from the medium: popular media has heavy recentist bias & prefers only contemporary research, while academic publications go back a few more years; the style of the reference can give a hint as to how relatively old some mentioned research or writings is. Frequently, given the author surname and a reasonable guess at some research being a year or two old, the name + date-range in GS will be enough to find the paper.

  • Try alternate spellings of British/American terms. Try searching GS for just the author (

  • Add jargon which
    Add jargon which might be used by relevant papers; for example, if you are looking for an article on college admissions statistics, any such analysis would probably be using
    被相关论文使用;例如,如果您正在寻找有关大学入学统计数据的文章,那么任何此类分析都可能会使用logistic regression and would express effects in terms of "odds".

If you don't know what jargon might be used, you may need to back off and look for a review article or textbook. Nothing is more frustrating that knowing there is a large literature on a topic ("Cowen's law") but being unable to find it because it's named something completely different than expected, and many fields have different names for the same concept or tool.

  • beware hastily dismissing 'bibliographic' websites:

While a site like
While a site like elibrary.ru is (almost) always useless & clutters up search results, every so often I run into a peculiar foreign website (often Indian or Chinese) which happens to have a scan of a book or paper. (eg
(几乎)总是无用且混乱的搜索结果,每隔一段时间我就碰到一个特殊的外国网站(通常是印度人或中国人),它恰好扫描了一本书或纸。 (例如Darlington 1954, which eluded me for well over half an hour until, taking the alternate approach of hunting its volume, I out of desperation clicked on an
直到超过半个小时才躲到我身边,采取另一种方法来寻找其音量,我绝望地点击了Indian index/library website which... had it.) Sometimes you have to check every hit, just in case.

Hard Cases {#hard-cases}

If the basic tricks aren't giving any hints of working, you will have to get serious. The title may be completely wrong, or it may be indexed under a different author, or not directly indexed at all, or hidden inside a database. Here are some indirect approaches to finding articles:

  • Take a look in GS's "related articles" or "cited by" to find similar articles such as later versions of a paper which may be useful. (These are also good features to know about if you want to check things like "has this ever been replicated?" or are still figuring out the right jargon to search.)
    看一下GS的"相关文章"或"引用者",找到类似的文章,例如可能有用的后期版本的论文。 (这些也是很好的功能,可以了解你是否想要查看"有没有被复制过?"之类的东西,或者仍在找出正确的术语来搜索。)

  • Look for hints of hidden bibliographic connections. Does a paper pop up high in the search results which doesn't seem to make sense? GS generally penalizes items which exist as simply bibliographic entries, so if one is ranked high in a sea of fulltexts, that should make you wonder why it is being prioritized. Similarly for Google Books (GB): a book might be forbidden from even snippets but rank high; that might be for a good reason.
    寻找隐藏的书目连接的暗示。在搜索结果中是否有一篇文章突然出现似乎没有意义? GS通常会对作为简单书目条目存在的项目进行处罚,因此如果一个项目在全文中排名很高,那么这应该会让您想知道为什么它会被优先考虑。同样,对于Google图书(GB):一本书甚至可能被禁止使用,但排名很高;这可能是有充分理由的。

  • some papers can be found by searching for the volume or book title to find it indirectly, especially conference proceedings or anthologies; many papers
    通过搜索卷或书名来间接找到它,特别是会议录或选集,可以找到一些论文;很多论文 appear to not be available online but are merely buried inside a 500-page PDF, and the G snippet listing is misleading.

Conferences are particularly complex bibliographically, so you may need to apply the same tricks as for page titles: drop parts, don't fixate on the numbers, know that the authors or ISBN or ordering of "title:subtitle" can differ between sources, etc.

  • Another approach is to look up the listing for a journal issue, and find the paper by hand; sometimes papers are listed in the journal issue's online Table of Contents, but just don't appear in search engines. In particularly insidious cases, a paper may be digitized & available---but lumped in with another paper due to error, or only as part of a catch-all file which contains the last 20 miscellaneous pages of an issue. Page range citations are particularly helpful here because they show where the overlap is, so you can download the suspicious overlapping 'papers' to see what they
    另一种方法是查找期刊的列表,并手工找到该论文;有时论文会在期刊的在线目录中列出,但不会出现在搜索引擎中。在特别阴险的情况下,纸张可以被数字化和可用 - 但由于错误而与另一篇论文混为一谈,或者仅作为包含问题的最后20个杂项页面的全文档的一部分。页面范围引用在这里特别有用,因为它们显示重叠的位置,因此您可以下载可疑的重叠"论文"以查看它们的内容 really contain.

Esoteric as this may sound, this has been a problem on multiple occasions. (A particularly epic example was
这听起来很神秘,这在很多场合都是个问题。 (一个特别史诗般的例子是Shockley 1966 where after an hour of hunting, all I had was bibliographic echoes despite apparently being published in a high profile, easily obtained, & definitely digitized journal,
在经过一个小时的狩猎之后,我所拥有的只是书目回声,尽管显然是在高调,易于获得和绝对数字化的期刊上发表的, Science , leaving me thoroughly puzzled. I eventually looked up the ToC and inferred it had been hidden in a set of abstracts! Or a number of
,让我彻底困惑。我最终抬头看了ToC并推断它隐藏在一组摘要中!或者一些SMPY papers turned out to be split or merged with neighboring items in journal issues, and I had to fix them.)

  • master/PhD theses: sorry. It may be hopeless if it's pre-2000. You may well find the citation and even an abstract, but actual fulltext...? If you have a university proxy, you may be able to get a copy off
    硕士/博士论文:抱歉。如果是在2000年之前,它可能是没有希望的。你可能会找到引文,甚至是一个抽象但实际的全文......?如果您有大学代理,您可以获得副本ProQuest. Otherwise, you need full university ILL services
    。否则,您需要完整的大学ILL服务^3^{#fnref3}, and even that might not be enough (a surprising number of universities appear to restrict access only to the university students/faculty, with the complicating factor of most theses being stored on microfilm).

  • if images are involved, a reverse image search in Google Images or
    如果涉及图像,则在Google图像或图像中进行反向图像搜索TinEye can turn up important leads.

  • domain knowledge:
    domain knowledge:

  • US federal court documents can be downloaded off
    美国联邦法院文件可以下载PACER after registration; it is pay-per-page but users under a certain level each quarter have their fees waived. There is a public mirror, called
    注册后;这是每页付费,但每个季度在一定水平以下的用户免收费用。有一面公共镜子,叫做RECAP, which can be searched & downloaded from for free. If you fail to find a case in RECAP and must use PACER, please install the
    ,可以免费搜索和下载。如果您在RECAP中找不到案例且必须使用PACER,请安装Firefox/Chrome browser extension, which will copy anything you download into RECAP. (This can be handy if you realize later that you should've kept a long PDF you downloaded or want to double-check a docket.)
    ,它会将您下载的任何内容复制到RECAP中。 (如果您稍后意识到您应该保存下载的长PDF或想要仔细检查文档,这可能很方便。)

There is no equivalent for state or county court systems, which are balkanized and use a thousand different systems (often privatized \& charging far more than PACER); those must be handled on a case by case basis. (Interesting trivia point: according to Nick Bilton's account of the Silk Road 1 case, the FBI and other federal agencies in the SR1 investigation would deliberately steer cases into state rather than federal courts in order to hide them from the relative transparency of the PACER system. The use of multiple court systems can backfire on them, however, as in the case of SR2's DoctorClu (see   
没有相应的州或县法院系统,这些系统是巴尔干化的,并且使用了一千种不同的系统(通常私有化和收费远远超过PACER);这些必须根据具体情况处理。 (有趣的琐事点:根据尼克·比尔顿对丝绸之路1案的描述,联邦调查局和SR1调查中的其他联邦机构会故意将案件引入州而不是联邦法院,以使其免于PACER系统的相对透明度然而,使用多个法院系统可能会适得其反,就像SR2的DoctorClu一样(见[the DNM arrest census](https://www.gwern.net/DNM-arrests) for details), where the local police filings revealed the use of hacking techniques to deanonymize SR2 Tor users, implicating CMU's CERT center---details which were belatedly scrubbed from the PACER filings.)  
有关详细信息),当地警方的文件显示使用黑客技术对SR2 Tor用户进行去匿名化,这涉及到CMU的CERT中心 - 这些细节是从PACER文件中删除的。)
  • for charity financial filings, do
    对于慈善财务文件,做Form 990 site:charity.com and then check GuideStar (eg
    然后检查GuideStar(例如"Case Study: Reading Edge's financial filings")

  • for anything related to education, do a site search of ERIC, which is similar to IA in that it will often have fulltext which is buried in the usual search results

By Quote or Description {#by-quote-or-description}

For quote/description searches: if you don't have a title and are falling back on searching quotes, try varying your search similarly to titles:

  • Try the easy search first.
  • Don't search too long a quote, a sentence or two is usually enough and can be helpful in turning up other sources quoting different chunks which may have better citations.
  • Try multiple sub-quotes from a big quote, especially from the beginning and end, which are likely to overlap with quotes which have prior or subsequent passages.
  • Look for passages in the original text which seem like they might be based on the same source, particularly if they are simply dropped in without any hint at sourcing; authors typically don't cite every time they draw on a source, usually only the first time, and during editing the 'first' appearance of a source could easily have been moved to later in the text. All of these additional uses are something to add to your searches.
  • You are fighting a game of Chinese whispers, so look for unique-sounding sentences and terms which can survive garbling in the repeated transmissions. Avoid phrases which could be easily reworded in multiple equivalent ways, as people usually will reword them when quoting from memory, screwing up literal searches.
  • Watch out for punctuation and spelling differences hiding hits.
  • Search for oddly-specific phrases or words, especially numbers. 3 or 4 keywords is usually enough.
  • Longer, less witty versions are usually closer to the original and a sign you are on the right trail.
  • Switch to GB and hope someone paraphrases or quotes it, and includes a real citation; if you can't see the full passage or the reference section, look up the book in Libgen.
Dealing With Paywalls {#dealing-with-paywalls}

A paywall can usually be bypassed by using Libgen (LG)/Sci-Hub (SH):
通常可以使用Libgen(LG)/ Sci-Hub(SH)绕过付费专区:papers can be searched directly (ideally with the DOI, but title+author with no quotes will usually work), or an easier way may be to prepend
可以直接搜索(理想情况下使用DOI,但标题+作者没有引号通常会起作用),或者更简单的方法可能是前置sci-hub.tw (or whatever SH mirror you prefer) to the URL of a paywall.

If those don't work and you do not have a university proxy or alumni access, many university libraries have IP-based access rules and also open WiFi or Internet-capable computers with public logins inside the library, which can be used, if you are willing to take the time to visit a university in person, for using their databases (probably a good idea to keep a list of needed items before paying a visit).

If that doesn't work, there is a more opaque ecosystem of filesharing services: booksc/bookfi/bookzz, private torrent trackers like Bibliotik,
如果这不起作用,有一个更不透明的文件共享服务生态系统:booksc / bookfi / bookzz,像Bibliotik这样的私人洪流跟踪器,IRC channels with
channels with XDCC bots like
bots like #bookz/
/#ebooks, old P2P networks like
,老P2P网络就好eMule, private
, private DC++ hubs...

Site-specific notes:

  • Elsevier/
    Elsevier/sciencedirect.com: easy, always available via SH/LG
    :简单,通过SH / LG始终可用

Note that many Elsevier journal websites do not work with the SH proxy, although their
请注意,许多Elsevier期刊网站不能与SH代理一起使用,尽管它们是sciencedirect.com version
version does and/or the paper is already in LG. If you see a link to
和/或论文已经在LG。如果你看到一个链接sciencedirect.com on a paywall, try it if SH fails on the journal website itself.

  • PsycNET: one of the worst sites; SH/LG never work with the URL method, rarely work with paper titles/DOIs, and with my university library proxy, combined searches don't usually work (frequently failing to pull up even bibliographic entries), and only DOI or manual title searches in the EBSCOhost database have a chance of fulltext. (EBSCOhost itself is a fragile search engine which is difficult to query reliably in the absence of a DOI.) Try to find the paper anywhere else besides PsycNET!
    :最糟糕的网站之一; SH / LG从不使用URL方法,很少使用纸质标题/ DOI,并且使用我的大学图书馆代理,组合搜索通常不起作用(经常无法提取甚至书目条目),只有DOI或手动标题搜索在EBSCOhost数据库中有机会获得全文。 (EBSCOhost本身是一个脆弱的搜索引擎,在没有DOI的情况下难以可靠地查询。)尝试在除PsycNET之外的任何地方找到该文件!

Request {#request}

Last resort: if none of this works, there are a few places online you can request a copy (however, they will usually fail if you have exhausted all previous avenues):

Finally, you can always try to contact the author. This only occasionally works for the papers I have the hardest time with, since they tend to be old ones where the author is dead or unreachable---any author publishing a paper since 1990 will usually have been digitized
最后,您可以随时尝试联系作者。这只是偶尔适用于我最困难时期的论文,因为它们往往是作者死亡或无法访问的旧论文 - 任何自1990年以来发表论文的作者通常都会被数字化 somewhere ---but it's easy to try.

Post-finding {#post-finding}

After finding a fulltext copy, you should find a reliable long-term link/place to store it and make it more findable:

  • never link LG/SH:
    never link LG/SH:

Always operate under the assumption they could be gone tomorrow. (As indeed my uncle found out with Library.nu shortly after paying for a lifetime membership.) There are no guarantees either one will be around for long under their legal assaults, and no guarantee that they are being properly mirrored or will be restored elsewhere. Download anything you need and keep a copy of it yourself and, ideally, host it publicly.
总是在明天可能会消失的假设下运作。 (事实上,我的叔叔在支付了终身会员资格后不久就发现了Library.nu。)无法保证任何一方在其法律攻击下长期存在,并且无法保证它们被正确反映或将在其他地方恢复。下载您需要的任何内容并自行保留其副本,理想情况下,公开托管。

  • never rely on a
    never rely on a papers.nber.org/tmp/ or
    or psycnet.apa.org URL, as they are temporary

  • never link Scribd: they are a scummy website which impede downloads, and anything on Scribd usually first appeared elsewhere anyway.

  • avoid linking to
    avoid linking to ResearchGate (compromised by investment & PDFs get deleted routinely, apparently often by authors) or
    (投资和PDF文件的妥协会被定期删除,显然通常由作者删除)或Academia.edu (the URLs are one-time and break)

  • be careful linking to Nature.com (if a paper is not
    小心链接到Nature.com(如果纸张没有 explicitly marked as Open Access, even if it's available, it may disappear in a few months!); similarly, watch out for
    , tandfonline.com,
    , jstor.org,
    , springer.com,
    , springerlink.com, &
    , & mendeley.com

  • be careful linking to academic personal directories on university websites (often noticeable by the Unix convention
    小心地链接到大学网站上的学术个人目录(通常由Unix惯例引人注目.edu/~user/); they have short half-lives.

  • check & improve metadata.

Adding metadata to papers/books is a good idea because it makes the file findable in G/GS (if it's not online, does it really exist?) and helps you if you decide to use bibliographic software like
添加元数据到论文/书籍是一个好主意,因为它使文件在G / GS中可以找到(如果它不在线,它真的存在吗?)并且如果您决定使用书目软件,可以帮助您Zotero in the future. Many academic publishers & LG are terrible about metadata, and will not include even title/author/DOI/year. PDFs can be easily annotated with metadata using
在将来。许多学术出版商和LG对元数据很糟糕,甚至不会包括标题/作者/ DOI /年。使用元数据可以轻松地使用元数据注释PDFExifTool: :
: : exiftool -All prints all metadata, and the metadata can be set individually using similar fields.

For papers hidden inside volumes or other files, you should extract the relevant page range to create a single relevant file. (For extraction of PDF page-ranges, I use
对于隐藏在卷或其他文件中的文档,您应提取相关的页面范围以创建单个相关文件。 (对于PDF页面范围的提取,我使用pdftk, eg:
, eg: pdftk 2010-davidson-wellplayed10-videogamesvaluemeaning.pdf cat 180-196 output 2009-fortugno.pdf.)

I try to set at least title/author/DOI/year/subject, and stuff any additional topics & bibliographic information into the "Keywords" field. Example of setting metadata:
我尝试至少设置标题/作者/ DOI /年/主题,并将任何其他主题和书目信息填入"关键字"字段。设置元数据的示例:

  exiftool -Author="Frank P. Ramsey" -Date=1930 -Title="On a Problem of Formal Logic" -DOI="10.1112/plms/s2-30.1.264" \\{#cb1-1}
      -Subject="mathematics" -Keywords="Ramsey theory, Ramsey's theorem, combinatorics, mathematical logic, decidability, \{#cb1-2}
      first-order logic,  Bernays-Schönfinkel-Ramsey class of first-order logic, _Proceedings of the London Mathematical \{#cb1-3}
      Society_, Volume s2-30, Issue 1, 1 January 1930, pg264-286" 1930-ramsey.pdf
  • if a scan, it may be worth editing the PDF to crop the edges, threshold to binarize it (which, for a bad grayscale or color scan, can drastically reduce filesize while increasing readability), and OCRing it. I use
    如果扫描,可能值得编辑PDF以裁剪边缘,将其二值化(对于不良灰度或彩色扫描,可以大幅减少文件大小同时提高可读性),并对其进行OCR处理。我用gscan2pdf but there are alternatives worth checking out.

  • if possible, host a public copy; especially if it was very difficult to find, even if it was useless, it should be hosted. The life you save may be your own.

  • for bonus points, link it in appropriate places on Wikipedia

Advanced {#advanced}

Aside from the highly-recommended use of hotkeys and Booleans for searches, there are a few useful tools for the researcher, which while expensive initially, can pay off in the long-term:

  • archiver-bot: automatically archive your web browsing and/or links from arbitrary websites to forestall linkrot; particularly useful for detecting & recovering from dead PDF links

  • PubMed & GS search alerts: set up alerts for a specific search query, or for new citations of a specific paper. (
    &GS搜索提醒:为特定搜索查询设置提醒,或为特定纸张的新引用设置提醒。 (Google Alerts is not as useful as it seems.)

  1. PubMed has straightforward conversion of search queries into alerts: "Create alert" below the search bar. (Given the volume of PubMed indexing, I recommend carefully tailoring your search to be as narrow as possible, or else your alerts may overwhelm you.)
  2. To create generic GS search query alert, simply use the "Create alert" on the sidebar for any search. To follow citations of a key paper, you must: 1. bring up the paper in GS; 2. click on "Cited by X"; 3. then use "Create alert" on the sidebar.
  • Google Custom Search Engines (a GCSE is a specialized search queries limited to whitelisted pages/domains etc; eg my
    (GCSE是一种专门的搜索查询,仅限于白名单页面/域名等;例如我的Wikipedia-focused anime/manga CSE. If you find yourself regularly including many domains in a search, or blacklisting domains with
    。如果您发现自己经常在搜索中包含许多域名,或者将域名列入黑名单-site: or using many negations to filter out common false positives, it may be time to set up a GCSE.)

  • Clipping/note-taking services like
    like Evernote/
    /Microsoft OneNote: regularly making and keeping excerpts creates a personalized search engine, in effect.

This can be vital for refinding old things you read where the search terms are hopelessly generic or you can't remember an
这对于重新阅读您阅读的旧事物至关重要,因为搜索条件无可挑剔或者您无法记住 exact quote or reference; it is one thing to search a keyword like "autism" in a few score thousand clippings, and another thing to search that in the entire Internet! (One can also reorganize or edit the notes to add in the keywords one is thinking of, to help with refinding.) I make heavy use of Evernote clipping and it is key to refinding my references.
引用或参考;在一些千分之一的剪报中搜索像"自闭症"这样的关键词是一回事,而在整个互联网中搜索它是另一回事! (还可以重新组织或编辑要添加到关键字中的注释,以帮助重新定位。)我大量使用Evernote剪辑,这是重写我的引用的关键。

Useful tools to know about:
, cURL,
, HTTrack; Firefox plugins:
; Firefox plugins: NoScript,
, uBlock origin,
, Live HTTP Headers,
, Bypass Paywalls, cookie exporting. Short of downloading a website, it might also be useful to pre-emptively archive it by using
,cookie导出。如果没有下载网站,通过使用预先存档它也可能是有用的linkchecker to crawl it, compile a list of all external & internal links, and store them for processing by another archival program (see
抓取它,编译所有外部和内部链接的列表,并存储它们以供另一个归档程序处理(参见Archiving URLs for examples).
for examples).

With proper use of pre-emptive archiving tools like
正确使用先发制人的归档工具archiver-bot, fixing linkrot in one's own pages is much easier, but that leaves other references. Searching for lost web pages is similar to searching for papers:

  • if the page title is given, search for the title.

It is a good idea to include page titles in one's own pages, as well as the URL, to help with future searches, since the URL may be meaningless gibberish on its own, and pre-emptive archiving can fail. HTML supports both
将页面标题包含在自己的页面中以及URL中是一个好主意,以帮助进行未来的搜索,因为URL本身可能毫无意义,并且先发制人的归档可能会失败。 HTML支持两者alt and
and title parameters in link tags, and, in cases where displaying a title is not desirable (because the link is being used inline as part of normal hypertextual writing), titles can be included cleanly in Markdown documents like this:
链接标记中的参数,以及在不希望显示标题的情况下(因为链接在内联中作为正常的超文本书写的一部分使用),标题可以在Markdown文档中干净地包含在这样的:[inline text description](URL "Title").

  • check the URL: is it weird or filling with trailing garbage like
    检查URL:它是奇怪的还是填充跟踪垃圾一样?rss=1 or
    or ?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FgJZg+%28Google+AI+Blog%29? Or a variant domain, like a
    /foo.com/amp/ URL?

  • restrict G search to the original domain with
    使用限制G搜索到原始域site:, or to related domains

  • restrict G search to the original date-range/years

  • try a different search engine: corpuses can vary, and in some cases G tries to be too smart for its own good when you need a literal search;
    尝试使用不同的搜索引擎:语料库可能会有所不同,在某些情况下,当您需要文字搜索时,G会尝试过于聪明。DuckDuckGo and
    and Bing are usable alternatives (especially if one of DuckDuckGo's 'bang' special searches is what one needs)

  • if nowhere on the clearnet, try the Internet Archive (IA) or the
    如果没有在clearnet上,请尝试Internet Archive(IA)或Memento meta-archive search engine:

IA is the default backup for a dead URL. If IA doesn't Just Work, there may be other versions in it:

  • did the IA 'redirect' you to an error page? Kill the redirect and check the earliest stored version. Did the page initially load but then error out/redirect? Disable JS with NoScript and reload.

  • IA lets you list all URLs with any archived versions, by searching for
    IA允许您通过搜索列出包含任何存档版本的所有URLURL/*; the list of available URLs may reveal an alternate newer/older URL. It can also be useful to filter by filetype or substring. For example, one might list all URLs in a domain, and if the list is too long and filled with garbage URLs, then using the "Filter results" incremental-search widget to search for "uploads/" on a WordPress blog.
    ;可用URL列表可能会显示备用较新/较旧的URL。按文件类型或子字符串过滤也很有用。例如,可以列出域中的所有URL,如果列表太长并且填充了垃圾URL,则使用"过滤结果"增量搜索小组件在WordPress博客上搜索"uploads /"。^4^{#fnref4}

![Screenshot of an oft-overlooked feature of the Internet Archive: displaying all available/archived URLs for a specific domain, filtered down to a subset matching a string like uploads/.](https://www.gwern.net/images/2019-internetarchive-domainsearch-screenshot.png) Screenshot of an oft-overlooked feature of the Internet Archive: displaying all available/archived URLs for a specific domain, filtered down to a subset matching a string like   
Internet Archive经常被忽视的功能的屏幕截图:显示特定域的所有可用/已存档URL,过滤到与字符串匹配的子集`*uploads/*`.  
* [`wayback_machine_downloader`](https://github.com/hartator/wayback-machine-downloader) (not to be confused with the [`internetarchive` Python package](https://github.com/jjjake/internetarchive) which provides a CLI interface to uploading files) is a Ruby tool which lets you download whole domains from IA, which can be useful for running a local fulltext search using regexps (a good `grep` query is often enough), in cases where just looking at the URLs via `URL/*` is not helpful. (An alternative which might work is [websitedownloader.io](https://websitedownloader.io "Wayback Machine Downloader: Download the source code and assets from Wayback Machine").)

  • did the domain change, eg from
    域名发生了变化,例如来自www.foo.com to
    to foo.com or
    or www.foo.org? Entirely different as far as IA is concerned.

  • is this a Blogspot blog? Blogspot is uniquely horrible in that it has versions of each blog for every country domain: a
    这是一个Blogspot博客吗? Blogspot非常可怕,因为它为每个国家/地区域都有每个博客的版本:afoo.blogspot.com blog could be under any of
    , foo.blogspot.au,
    , foo.blogspot.hk,
    , foo.blogspot.jp...

  • did the website provide RSS feeds?

A little known fact is that   
一个鲜为人知的事实是[Google Reader](https://en.wikipedia.org/wiki/Google%20Reader "Wikipedia: Google Reader") (GR; October 2005-July 2013) stored all RSS items it crawled, so if a website's RSS feed was configured to include full items, the RSS feed history was an alternate mirror of the whole website, and since GR never removed RSS items, it was possible to retrieve pages or whole websites from it. GR has since closed down, sadly, but before it closed,   
(GR; 2005年10月 - 2013年7月)存储了它抓取的所有RSS项目,因此如果网站的RSS源配置为包含完整项目,则RSS源历史记录是整个网站的备用镜像,并且由于GR从未删除RSS项目,可以从中检索页面或整个网站。遗憾的是,遗传资源已经关闭,但在关闭之前,[Archive Team](https://en.wikipedia.org/wiki/Archive%20Team "Wikipedia: Archive Team")   
[downloaded](https://www.archiveteam.org/index.php/Google_Reader) a large fraction of GR's historical RSS feeds, and   
GR的历史RSS提要的很大一部分,和 *those* archives are   
archives are [now hosted on IA](https://archive.org/details/archiveteam_greader). The catch is that they are stored in mega-  
。问题在于它们存储在大型[WARCs](https://en.wikipedia.org/wiki/Web%20ARChive "Wikipedia: Web ARChive"), which, for all their archival virtues, are not the most user-friendly format. The raw GR mega-WARCs are difficult enough to work with that I   
对于他们所有的档案美德来说,这不是最用户友好的格式。原始的GR mega-WARC很难与我合作[defer an example to the appendix](#searching-the-google-reader-archives).  
  • archive.today: an IA-like mirror
    : an IA-like mirror

  • any local archives, such as those made with my

  • Google Cache (GC): GC works, sometimes, but the copies are usually the worst around, ephemeral & cannot be relied upon. Google also appears to have been steadily deprecating GC over the years, as GC shows up less & less in search results.

Digital {#digital}

E-books are rarer and harder to get than papers, although the situation has improved vastly since the early 2000s. To search for books online:

  • book searches tend to be faster and simpler than paper searches, and to require less cleverness in search query formulation. Typically, if the main title + author doesn't turn it up, it's not online. (In some cases, the author order is reversed, or the title:subtitle are reversed, and you can find a copy by tweaking your search, but these are rare.)
    书籍搜索往往比纸质搜索更快更简单,并且在搜索查询制定中需要更少的聪明才智。通常,如果主标题+作者没有将其打开,则不在线。 (在某些情况下,作者顺序是相反的,或者标题:副标题是相反的,您可以通过调整搜索来找到副本,但这些很少见。)

  • search G for title (book fulltexts usually don't show up in GS); to double-check, you can try a
    在G中搜索标题(书籍全文通常不会出现在GS中);仔细检查一下,你可以尝试一下filetype:pdf search

  • then check LG
    then check LG

  • the
    the Internet Archive (IA/
    (IA/archive.org) has many books scanned which do not appear easily in search results.

  • If an IA hit pops up in a search, always check it; the OCR may offer hints as to where to find it. If you don't find anything or the provided, try doing an IA site search in G ( not the IA built-in search engine), eg book title site:archive.org.

  • if it is on IA but the IA version is DRMed and is only available for "checkout", you can jailbreak it by downloading the PDF version to Adobe Digital Elements <=4.0, which can be run in Wine, and then import it to Calibre with the De-DRM plugin, which will produce a DRM-free PDF inside Calibre's library. (Getting De-DRM running can be tricky, especially under Linux. I wound up having to edit some of the paths in the Python files to make them work with Wine.) You can then add metadata to the PDF & upload it to LG^6^{#fnref6}. (LG's versions of books are usually better than the IA scans, but if they don't exist, IA's is better than nothing.)

  • Google Play: use the same PDF DRM as IA, can be broken same way
    :使用与IA相同的PDF DRM,可以以相同的方式打破

  • HathiTrust also hosts many book scans, which can be searched for clues or hints or jailbroken.

HathiTrust blocks whole-book downloads but it's easy to download each page in a loop and stitch them together, for example:

Another example of this would be the Wellcome Library; while looking for
另一个例子是Wellcome图书馆;在寻找An investigation into the relation between intelligence and inheritance, Lawrence 1931, I came up dry until I checked one of the last search results, a
劳伦斯1931年,在我检查了最后一个搜索结果之前,我干涸了"Wellcome Digital Library" hit, on the slim off-chance that, like the occasional Chinese/Indian library website, it just might have fulltext. As it happens, it did---good news? Yes, but with a caveat: it provides
,就像偶尔的中国/印度图书馆网站这样的机会很少,它可能只有全文。碰巧,它确实 - 好消息?是的,但有一点需要注意:它提供了 no way to download the book! It provides OCR, metadata, and individual page-image downloads all under CC-BY-NC-SA (so no legal problems), but... not the book. (The OCR is also unnecessarily zipped, so that is why Google ranked the page so low and did not show any revealing excerpts from the OCR transcript: because it's hidden in an opaque archive to save a few kilobytes while destroying SEO.) Examining the download URLs for the highest-resolution images, they follow an unfortunate schema:
下载书的方式!它提供了所有在CC-BY-NC-SA下的OCR,元数据和单独的页面图像下载(所以没有法律问题),但......不是本书。 (OCR也是不必要的压缩,所以这就是为什么谷歌将页面排在如此低的位置,并没有显示任何来自OCR成绩单的摘录:因为它隐藏在一个不透明的存档中,以便在销毁SEO时节省几千字节。)检查下载最高分辨率图像的网址,它们遵循一个不幸的架构:

  1. https://dlcs.io/iiif-img/wellcome/1/5c27d7de-6d55-473c-b3b2-6c74ac7a04c6/full/2212,/0/default.jpg
  2. https://dlcs.io/iiif-img/wellcome/1/d514271c-b290-4ae8-bed7-fd30fb14d59e/full/2212,/0/default.jpg
  3. etc

Instead of being sequentially numbered 1--90 or whatever, they all live under a unique hash or ID. Fortunately, one of the metadata files, the 'manifest' file, provides all of the hashes/IDs (but not the high-quality download URLs). Extracting the IDs from the manifest can be done with some quick
它们不是按顺序编号1--90或其他,而是存在于唯一的散列或ID下。幸运的是,其中一个元数据文件"manifest"文件提供了所有哈希/ ID(但不提供高质量的下载URL)。从清单中提取ID可以快速完成sed &
& tr string processing, and fed into another short
字符串处理,并喂入另一个短wget loop for download
loop for download

And then the 59MB of JPGs can be cleaned up as usual with
然后可以像往常一样清理59MB的JPGgscan2pdf (empty pages deleted, tables rotated, cover page cropped, all other pages binarized), compressed/OCRed with
(删除空页面,旋转表格,覆盖页面裁剪,所有其他页面二值化),压缩/ OCRedocrmypdf, and metadata set with
和设置的元数据exiftool, producing a readable, downloadable, highly-search-engine-friendly 1.8MB PDF.
,生成可读,可下载,高度搜索引擎友好的1.8MB PDF。

  • ebook.farm is a Kindle pirate website which takes Amazon gift-cards as currency; it has many recent e-books which are DRM-free and can be uploaded to LG.

  • remember the
    remember the analog hole works for papers/books too:

if you can find a copy to
如果你能找到一份副本 read , but cannot figure out how to
,但无法弄清楚如何 download it directly because the site uses JS or complicated cookie authentication or other tricks, you can always exploit the 'analogue hole'---fullscreen the book in high resolution & take screenshots of every page; then crop, OCR etc. This is tedious but it works. And if you take screenshots at sufficiently high resolution, there will be relatively little quality loss. (This works better for books that are scans than ones born-digital.)
它直接因为网站使用JS或复杂的cookie身份验证或其他技巧,你可以随时利用"模拟漏洞" - 以高分辨率全屏显示本书并截取每页的截图;然后裁剪,OCR等。这很乏味但它确实有效。如果以足够高的分辨率截取屏幕截图,则质量损失相对较小。 (这对于扫描书而言比数字出生的书更好。)

Physical {#physical}

Books are something of a double-edged sword compared to papers/theses. On the one hand, books are much more often unavailable online, and must be bought offline, but at least you almost always
与论文/论文相比,书籍是一把双刃剑。一方面,书籍在网上更常见,而且必须离线购买,但至少你几乎总是这样 can buy used books offline without much trouble (and often for <$10 total); on the other hand, while paper/theses are often available online, when one is not unavailable, it's usually
离线购买旧书,没有太多麻烦(通常总共&lt;10美元);另一方面,虽然论文/论文经常在网上提供,但当一篇论文不可用时,通常就是这样 very unavailable, and you're stuck (unless you have a university ILL department backing you up or are willing to travel to the few or only universities with paper or microfilm copies).

Purchasing from used book sellers:

  • Google Books is a good starting point for seller links; if buying from a marketplace like AbeBooks/Amazon/Barnes & Noble, it's worth searching the seller to see if they have their own website, which is potentially much cheaper. They may also have multiple editions in stock.
    Google图书是卖家链接的良好起点;如果从像AbeBooks / Amazon / Barnes&Noble这样的市场购买,那么值得搜索卖家,看看他们是否有自己的网站,这可能要便宜得多。它们也可能有多个版本的库存。

  • Sellers:

  • bad: eBay & Amazon, due to high-minimum-order+S&H, but can be useful in providing metadata like page count or ISBN or variations on the title
    坏:eBay和亚马逊,由于高最低订单+ S&H,但可用于提供元数据,如页数或ISBN或标题的变体

  • good:
    good: AbeBooks,
    , Thrift Books,
    , Better World Books,
    , B&N,
    , Discover Books.

Note: on AbeBooks, international orders can be useful (especially for behavioral genetics or psychology books) but be careful of international orders with your credit card---many debit/credit cards will fail and trigger a fraud alert, and PayPal is not accepted.  
注意:在AbeBooks上,国际订单可能很有用(特别是对于行为遗传学或心理学书籍)但请小心使用您的信用卡进行国际订单---许多借记卡/信用卡将失败并触发欺诈警报,并且不接受PayPal 。
  • if a book is not available or too expensive, set price watches: AbeBooks supports email alerts on stored searches, and Amazon can be monitored via
    如果一本书不可用或太贵,设定价格手表:AbeBooks支持存储搜索的电子邮件警报,亚马逊可以通过CamelCamelCamel (remember the CCC price alert you want is on the
    (记住你想要的CCC价格提醒是在 used third-party category, as new books are more expensive, less available, and unnecessary).


  • destructive vs non-destructive: destructively debinding books with a razor or guillotine cutter works much better & is much less time-consuming than spreading them on a flatbed scanner to scan one-by-one
    破坏性与非破坏性:使用剃须刀或断头台刀具破坏性地去除书籍的效果要好得多,并且比将它们放在平板扫描仪上逐一扫描要便宜得多^7^{#fnref7}, because it allows use of a sheet-fed scanner instead, which is easily 5x faster and will give higher-quality scans (because the sheets will be flat, scanned edge-to-edge, and much more closely aligned).
    {#fnref7},因为它允许使用单张纸扫描仪,它可以轻松快5倍并提供更高质量的扫描(因为纸张将是平的,边对边扫描,并且更紧密地对齐) 。

  • Tools:

  • For simple debinding of a few books a year, an X-acto knife/razor is good (avoid the 'triangle' blades, get curved blades intended for large cuts instead of detail work)

  • once you start doing more than one a month, it's time to upgrade to a guillotine blade paper cutter (a fancier swinging-arm paper cutter, which uses a two-joint system to clamp down and cut uniformly).

A guillotine blade can cut chunks of 200 pages easily without much slippage, so for books with more pages, I use both: an X-acto to cut along the spine and turn it into several 200-page chunks for the guillotine cutter.  
  • at some point, it may make sense to switch to a scanning service like
    在某些时候,切换到像这样的扫描服务可能是有意义的1DollarScan (1DS has acceptable quality for the black-white scans I have used them for thus far, but watch out for their nickel-and-diming fees for OCR or "setting the PDF title"; these can be done in no time yourself using
    /ocrmypdf and will save a
    and will save a lot of money as they, amazingly, bill by 100-page units). Books can be sent directly to 1DS, reducing logistical hassles.

  • after scanning, crop/threshold/OCR/add metadata
    扫描后,裁剪/阈值/ OCR /添加元数据

  • Adding metadata: same principles as papers. While more elaborate metadata can be added, like bookmarks, I have not experimented with those yet.

  • Saving files:
    Saving files:

In the past, I used
在过去,我用过DjVu for documents I produce myself, as it produces much smaller scans than gscan2pdf's default PDF settings
我生产的文件,因为它产生的扫描比gscan2pdf的默认PDF设置小得多due to a buggy Perl library (at least half the size, sometimes one-tenth the size), making them more easily hosted & a superior browsing experience.

The downsides of DjVu are that not all PDF viewers can handle DjVu files, and it appears that G/GS ignore all DjVu files (despite the format being 20 years old), rendering them completely unfindable online. In addition, DjVu is an increasingly obscure format and has, for example, been dropped by the IA as of 2016. The former is a relatively small issue, but the latter is fatal---being consigned to oblivion by search engines largely defeats the point of scanning! ("If it's not in Google, it doesn't exist.") Hence, despite being a worse format, I now recommend PDF and have stopped using DjVu for new scans
DjVu的缺点是并非所有PDF查看器都可以处理DjVu文件,并且看起来G / GS忽略了所有DjVu文件(尽管格式为20年),使它们在线完全不可用。此外,DjVu是一种越来越模糊的格式,例如,截至2016年已被IA放弃。前者是一个相对较小的问题,但后者是致命的 - 被搜索引擎遗忘的很大程度上打败了扫描点! ("如果它不在Google中,则不存在。")因此,尽管格式较差,我现在推荐使用PDF并停止使用DjVu进行新扫描^8^{#fnref8} and have converted my old DjVu files to PDF.

  • Uploading: to LibGen, usually. For backups, filelockers like Dropbox, Mega, MediaFire, or Google Drive are good. I usually upload 3 copies including LG. I rotate accounts once a year, to avoid putting too many files into a single account.
    上传:通常是LibGen。对于备份,Dropbox,Mega,MediaFire或Google Drive等文件锁定工具都很不错。我通常上传3份,包括LG。我每年轮换一次帐户,以避免将太多文件放入一个帐户。

  • Hosting: hosting papers is easy but books come with risk:

Books can be dangerous; in deciding whether to host a book, my rule of thumb is host only books pre-2000 and which do not have Kindle editions or other signs of active exploitation and is effectively an '
书籍可能是危险的;在决定是否主持一本书时,我的经验法则是仅限于2000年以前的主题书籍,并且没有Kindle版本或其他积极利用的迹象,实际上是&#39;orphan work'.

As of 11 December 2018, hosting 3763 files over 8 years (very roughly, assuming linear growth, <5.5 million document-days of hosting: 3763⋅0.5⋅8⋅365.25=5497743), I've received 3 takedown orders: a behavioral genetics textbook (2013),
截至2018年12月11日,在8年内托管3763个文件(非常粗略,假设线性增长,&lt;550万个文档日托管:3763⋅0.5⋅8⋅365.25= 5497743),我收到了3个删除命令:行为遗传学教科书(2013), The Handbook of Psychopathy (2005), and a recent meta-analysis paper (Roberts et al 2016). I broke my rule of thumb to host the 2 books (my mistake), which leaves only the 1 paper, which I think was a fluke. So, as long as one avoids relatively recent books, the risk should be minimal.

Searching the Google Reader archives {#searching-the-google-reader-archives}

One way to 'undelete' a blog or website is to use Google Reader (GR).

GR crawled regularly almost all blogs' RSS feeds; RSS feeds often contain the fulltext of articles. If a blog author writes an article, the fulltext is included in the RSS feed, GR downloads it, and then the author changes their mind and edits or deletes it, GR would redownload the new version but it would continue to show the version the old version as well (you would see two versions, chronologically). If the author blogged regularly and so GR had learned to check regularly, it could hypothetically grab many different edited versions, even, not just ones with weeks or months in between. Assuming that GR did not, as it sometimes did for inscrutable reasons, stop displaying the historical archives and only showed the last 90 days or so to readers; I was never able to figure out why this happened or if indeed it really did happen and was not some sort of UI problem. Regardless, if all went well, this let you undelete an article, albeit perhaps with messed up formatting or something. Sadly, GR was closed back in 2013 and you cannot simply log in and look for blogs.
GR经常抓取几乎所有博客的RSS源; RSS源通常包含文章的全文。如果博客作者写了一篇文章,全文都包含在RSS提要中,GR下载它,然后作者改变主意并编辑或删除它,GR会重新下载新版本,但它会继续显示旧版本版本也是(你会看到两个版本,按时间顺序)。如果作者定期发表博客,因此GR已经学会定期检查,那么它可以假设地获取许多不同的编辑版本,甚至不仅仅是几周或几个月之间。假设GR没有,因为它有时会出于不可思议的原因,停止显示历史档案,只显示最近90天左右的读者;我从来没有弄清楚为什么会发生这种情况,或者确实确实发生了这种情况并且不是某种UI问题。无论如何,如果一切顺利,这可以让你取消删除一篇文章,尽管可能是混乱的格式或其他东西。可悲的是,GR在2013年关闭了,你不能简单地登录并寻找博客。

However, before it was closed,
但是,在它关闭之前,Archive Team launched a major effort to download as much of GR as possible. So in that dump, there may be archives of all of a random blog's posts. Specifically: if a GR user subscribed to it; if Archive Team knew about it; if they requested it in time before closure; and if GR did keep full archives stretching back to the first posting.

Downside: the Archive Team dump is
缺点:存档团队转储是 not in an easily browsed format, and merely figuring out what it
以易于浏览的格式,只是弄清楚它是什么 might have is difficult. In fact, it's so difficult that before researching Craig Wright in November-December 2015, I never had an urgent enough reason to figure out how to get anything out of it before, and I'm not sure I've ever seen anyone actually use it before; Archive Team takes the attitude that it's better to preserve the data somehow and let posterity worry about
有困难。事实上,在2015年11月至12月期间研究Craig Wright之前,我一直没有足够的理由去弄清楚如何从中获取任何东西,而且我不确定我是否见过任何人实际使用过之前;档案团队采取的态度是,以某种方式保存数据更好,让后人担心 using it. (There is a site which claimed to be a frontend to the dump but when I tried to use it,
它。 (有一个网站声称是转储的前端,但当我尝试使用它时,it was broken & still is in December 2018.)

Results {#results}

My dd extraction was successful, and the resulting HTML/RSS could then be browsed with a command like
提取成功,然后可以使用类似命令浏览生成的HTML / RSScat *.warc | fold --spaces -width=200 | less. They can probably also be converted to a local form and browsed, although they won't include any of the site assets like images or CSS/JS, since the original RSS feed assumes you can load any references from the original website and didn't do any kind of
。它们也可以转换为本地形式并进行浏览,但它们不包含任何网站资产,如图像或CSS / JS,因为原始RSS提要假设您可以从原始网站加载任何引用而不是做任何一种data-URI or mirroring (not, after all, having been intended for archive purposes in the first place...)