提问者:小点点

在Python/元素树中从300MG Xml中移除元素


我试图解析一个300MB的XML在ElementTree,根据建议Pythonxml ElementTree解析一个非常大的xml文件?

from xml.etree import ElementTree as Et

for event, elem in Et.iterparse('C:\...path...\desc2015.xml'):  
    if elem.tag == 'DescriptorRecord':
        for e in elem._children:
            if str(e.tag) in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
                e.clear()
                elem.remove(e)
                print 'removed %s' % e

给予…

removed <Element 'HistoryNote' at 0x557cc7f0>
removed <Element 'DateCreated' at 0x557fa990>
removed <Element 'HistoryNote' at 0x55809af0>
removed <Element 'DateCreated' at 0x5580f5d0>

但是,这只是继续进行,文件没有变得更小,并且在检查时元素仍然存在。尝试了e. Clear()或elem.delete(e),但结果相同。问候

我对@alexanderlukanin13的回答的第一条评论中的错误代码:


共1个答案

匿名用户

脚本中的主要问题是您没有将更改的XML保存回磁盘。您需要存储对根元素的引用,然后调用ElementTree. write:

from xml.etree import ElementTree as Et

context = Et.iterparse('input.xml')
root = None
for event, elem in context:
    if elem.tag == 'DescriptorRecord':
        for e in list(elem.getchildren()):  # Don't use _children, it's a private field
            if e.tag in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
                elem.remove(e)  # You need remove(), not clear()
    root = elem

with open('output.xml', 'wb') as file:
    Et.ElementTree(root).write(file, encoding='utf-8', xml_declaration=True)

注意:这里我使用一种尴尬(可能不安全)的方式来获取根元素-我假设它始终是iterparse输出中的最后一个元素。如果有人知道更好的方法,请告诉。