jesusjsc

sklearn离线加载20newsgroups数据集
# sklearn离线加载20newsgroups数据集 ---------- 当前环境:Ubuntu...
扫描右侧二维码阅读全文
02
2019/04

sklearn离线加载20newsgroups数据集

sklearn离线加载20newsgroups数据集


当前环境:Ubuntu16.04 + anaconda2 + sklearn0.17.1


使用sklearn在线加载fetch_20newsgroups数据集时

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')

容易出现各种错误,如

No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"

于是,改用先离线将数据集下载好,然后修改源代码的方式离线加载数据集。

方法如下:

  1. 建立目录:~/scikit_learn_data/20news_home
  2. Home Page for 20 Newsgroups Data Set下载20news-bydate.tar.gz数据集,并放于步骤1中建立的目录
  3. sklearn已安装的路径中找到twenty_newsgroups.py文件,如: ~/anaconda2/lib/python2.7/site-packages/sklearn/datasets/twenty_newsgroups.py
  4. 打开编辑该文件,并注释掉部分代码,如下所示:

    def download_20newsgroups(target_dir, cache_path):
        """Download the 20 newsgroups data and stored it as a zipped pickle."""
        archive_path = os.path.join(target_dir, ARCHIVE_NAME)
        train_path = os.path.join(target_dir, TRAIN_FOLDER)
        test_path = os.path.join(target_dir, TEST_FOLDER)
    
        if not os.path.exists(target_dir):
            os.makedirs(target_dir)
    
        # if os.path.exists(archive_path):
        #     # Download is not complete as the .tar.gz file is removed after
        #     # download.
        #     logger.warning("Download was incomplete, downloading again.")
        #     os.remove(archive_path)
        # 
        # logger.warning("Downloading dataset from %s (14 MB)", URL)
        # opener = urlopen(URL)
        # with open(archive_path, 'wb') as f:
        #     f.write(opener.read())
    
        logger.info("Decompressing %s", archive_path)
        tarfile.open(archive_path, "r:gz").extractall(path=target_dir)
        os.remove(archive_path)
    
        # Store a zipped pickle
        cache = dict(train=load_files(train_path, encoding='latin1'),
                     test=load_files(test_path, encoding='latin1'))
        compressed_content = codecs.encode(pickle.dumps(cache), 'zlib_codec')
        with open(cache_path, 'wb') as f:
            f.write(compressed_content)
    
        shutil.rmtree(target_dir)
        return cache
  5. 至此,再运行fetch_20newsgroups时,即可加载成功。执行完毕可发现,~/scikit_learn_data目录下多了20news-bydate.pkz文件,以后再重新加载数据集时,可以直接加载了。
Last modification:May 27th, 2019 at 05:19 pm
If you think my article is useful to you, please feel free to appreciate

Leave a Comment