当前位置: 首页 > news >正文

NLTK库: 数据集3-分类与标注语料(Categorized and Tagged Corpora)

NLTK库: 数据集3-分类与标注语料(Categorized and Tagged Corpora)

1.二分类语料

主要是电影语料,和情绪(积极消极、主观客观)有关,有以下2个语料:

1.1 movie_reviews: IMDb 影评

IMDb(Internet Movie Database)是一个广泛使用的电影数据库,提供电影、电视剧等的评分和用户评论。

  • 数据量

正/负评价标签,共2000个,正负评价各有1000

[‘neg/cv000_29416.txt’, ‘neg/cv001_19502.txt’, ‘neg/cv002_17424.txt’, ‘neg/cv003_12683.txt’, ‘neg/cv004_12641.txt’, ‘neg/cv005_29357.txt’, ‘neg/cv006_17022.txt’, ‘neg/cv007_4992.txt’, ‘neg/cv008_29326.txt’, ‘neg/cv009_29417.txt’, …]

[…, ‘pos/cv992_11962.txt’, ‘pos/cv993_29737.txt’, ‘pos/cv994_12270.txt’, ‘pos/cv995_21821.txt’, ‘pos/cv996_11592.txt’, ‘pos/cv997_5046.txt’, ‘pos/cv998_14111.txt’, ‘pos/cv999_13106.txt’]

  • 标签

二分类问题,[‘neg’, ‘pos’]

  • 评价内容

第1个negative review (neg/cv000_29416):

plot : 
two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . critique : 
a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , 
since i generally applaud films which attempt to break the mold , 
mess with your head and such ( lost highway & memento ) , 
but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea what's going on . 
there are dreams , there are characters coming back from the dead , there are others who look like the dead , 
there are strange apparitions , there are disappearances , there are a looooot of chase scenes , 
there are tons of weird things that happen , and most of it is simply not explained . 
now i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , 
i get kind of fed up after a while , which is this film's biggest problem . 
it's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . 
and do they make things entertaining , thrilling or even engaging , in the meantime ? 
not really . 
the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , 
so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . 
i guess the bottom line with movies like this is that you should always make sure that the audience is " into it " even before 
they are given the secret password to enter your world of understanding . 
i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! 
okay , we get it . . . there 
are people chasing her and we don't know who they are . do we really need to see it over and over again ? 
how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? 
apparently , the studio took this film away from its director and chopped it up themselves , and it shows . 
there might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess " the suits " decided that turning it into a music video with little edge ,
would make more sense . 
the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . 
but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . 
overall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , 
despite a pretty cool ending and explanation to all of the craziness that came before it . 
oh , and by the way , this is not a horror or teen slasher flick . . . it's 
just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . 
it also wrapped production two years ago and has been sitting on the shelves ever since . 
whatever . . . skip 
it ! where's joblo coming from ? 
a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) 

第2个negative review (neg/cv001_19502):

damn that y2k bug . 
it's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) 
in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . 
little do they know the power within . . . 
going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . 
we don't know why the crew was really out in the middle of nowhere , we don't know the origin of what took over the ship
( just that a big pink flashy thing hit the mir ) , and , of course , we don't know why donald sutherland is stumbling around drunkenly throughout . 
here , it's just " hey , let's chase these people around with some robots " . 
the acting is below average , even from the likes of curtis . 
you're more likely to get a kick out of her work in halloween h20 . 
sutherland is wasted and baldwin , well , he's acting like a baldwin , of course . 
the real star here are stan winston's robot design , some schnazzy cgi , and the occasional good gore shot , like picking into someone's brain . 
so , if robots and body parts really turn you on , here's your movie . 
otherwise , it's pretty much a sunken ship of a movie . 

影评内容风格各异,有长有短

1.2 subjectivity:电影摘要与评论

用于主观性分析的数据集,这个语料库由 5000 条主观句子(subjective)和 5000 条客观句子(objective)组成,专门用于情感分析和主观性分类任务。

来源于 Bo Pang 和 Lillian Lee 的研论文《A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts》(ACL 2004)。

  • 5000 条主观句子(subjective sentences)来自评论,因为它们表达了作者的观点、情感或评价。

  • 5000 条客观句子(objective sentences),来自 《电影剧情摘要》,这些摘要通常是事实性的描述,不带有明显的个人情感或评价。

该数据集样本不用文件夹分割,所有影评放在一个文本

每个句子都经过预处理,单词和标点符号以空格分隔,使用 WhitespaceTokenizer 解析。

  • obj的第一个样本

[‘the’, ‘movie’, ‘begins’, ‘in’, ‘the’, ‘past’, ‘where’, ‘a’, ‘young’, ‘boy’, ‘named’, ‘sam’, ‘attempts’, ‘to’, ‘save’, ‘celebi’, ‘from’, ‘a’, ‘hunter’, ‘.’]
[‘obj’, ‘subj’]

这里用的是sents方法,如果用raw()方法会返回全部样本字符

  • subj的第一个样本

smart and alert , thirteen conversations about one thing is a small gem .

这里用的是字符串自带方法join(), 即用空格分割列表元素并转为字符串:" ".join()

  • 完整代码
from nltk.corpus import subjectivitysubj = subjectivity.sents(categories='subj')  # fileids, raw 获取主观句子
obj = subjectivity.sents(categories='obj')   # 获取客观句子
categories = subjectivity.categories()              # 返回 ['obj', 'subj']print(len(subj)," ".join(subj[0]))
print(len(obj),obj[0])
print(categories)

2. 多分类语料

第一节的路透社(reuters)带有《新闻主题》多标签:

[‘acq’, ‘alum’, ‘barley’, ‘bop’, ‘carcass’, ‘castor-oil’, ‘cocoa’, ‘coconut’, ‘coconut-oil’, ‘coffee’, ‘copper’, ‘copra-cake’, ‘corn’, ‘cotton’, ‘cotton-oil’, ‘cpi’, ‘cpu’, ‘crude’, ‘dfl’, ‘dlr’, ‘dmk’, ‘earn’, ‘fuel’, ‘gas’, ‘gnp’, ‘gold’, ‘grain’, ‘groundnut’, ‘groundnut-oil’, ‘heat’, ‘hog’, ‘housing’, ‘income’, ‘instal-debt’, ‘interest’, ‘ipi’, ‘iron-steel’, ‘jet’, ‘jobs’, ‘l-cattle’, ‘lead’, ‘lei’, ‘lin-oil’, ‘livestock’, ‘lumber’, ‘meal-feed’, ‘money-fx’, ‘money-supply’, ‘naphtha’, ‘nat-gas’, ‘nickel’, ‘nkr’, ‘nzdlr’, ‘oat’, ‘oilseed’, ‘orange’, ‘palladium’, ‘palm-oil’, ‘palmkernel’, ‘pet-chem’, ‘platinum’, ‘potato’, ‘propane’, ‘rand’, ‘rape-oil’, ‘rapeseed’, ‘reserves’, ‘retail’, ‘rice’, ‘rubber’, ‘rye’, ‘ship’, ‘silver’, ‘sorghum’, ‘soy-meal’, ‘soy-oil’, ‘soybean’, ‘strategic-metal’, ‘sugar’, ‘sun-meal’, ‘sun-oil’, ‘sunseed’, ‘tea’, ‘tin’, ‘trade’, ‘veg-oil’, ‘wheat’, ‘wpi’, ‘yen’, ‘zinc’]

共有大约 90 个左右的主题标签,覆盖财经、商品、贸易、市场、货币等领域。

每个新闻稿可以属于多个标签。

其输出代码如下:

from nltk.corpus import reuters
categories = reuters.categories() # # 输出所有的分类标签
print(categories)

另一类多分类的是product_reviews_1商品评论:

2.1 product_reviews_1商品概述

  • Apex_AD2600_Progressive_scan_DVD player.txt

    • 逐行扫描 DVD 播放器:支持 480p 高分辨率输出,适合 HDTV 或 HD-ready 电视。
    • 多格式兼容:可播放 DVD、MP3 CD、WMA CD、JPEG/Kodak 图片 CD(用于幻灯片播放),部分支持 DVD-R。
    • 全屏适配功能(AFF):将 16:9 宽屏视频调整为 4:3 电视屏幕。
    • 价格:2003-2004 年价格约 39.99-69.99 美元(折扣后),定位经济型。
    • 评论概况:
      • 正面:用户称赞其多格式播放能力和性价比,例如“几乎可以播放任何放入的碟片” 。
      • 负面:遥控器功能不佳(反应迟钝,非通用型)、部分 DVD(如迪士尼电影)无法播放、耐用性差(部分设备数月内故障)。
      • 情感关键词:“picture quality”, “cheap”, “remote doesn’t work”
      • 总结:因价格低廉和格式支持广受好评,但可靠性和遥控器问题受批评。
  • Canon_G3.txt

    • 数码相机:400 万像素传感器,适合 2000 年代初摄影需求。
    • 镜头:佳能高品质镜头,配备光学变焦(约 4 倍)。
    • 功能:支持手动控制、RAW 格式拍摄,紧凑设计适合进阶用户、摄影爱好者和半专业人士。
    • 评论概况:
      • 正面:图像质量高、手动控制灵活、多功能,例如“照片清晰且色彩鲜艳”
      • 负面:手动设置学习曲线陡峭,机身较点拍相机稍显笨重。
      • 情感关键词:“great pictures”, “slow focus”, “battery life is bad”
      • 总结:因专业功能和紧凑设计广受好评,适合追求高质量摄影的用户。
  • Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt

    • 便携音频设备(MP3 播放器), 2003 年发布,定位高端音乐爱好者。
    • 存储容量:40GB 硬盘,可存储约 10,000 首歌曲。
    • 音频支持:支持 MP3、WMA 等格式,音质优异。
    • 功能:可更换电池、大屏幕显示、USB 2.0 快速传输。
    • 评论概况:
      • 正面:大容量和音质受好评,例如“存储大量歌曲,音质清晰” 。
      • 负面:硬盘启动慢、界面复杂、偶尔死机或电池寿命短。
      • 情感关键词:“sound is awesome”, “software sucks”, “large capacity”
      • 总结:因大容量和高音质受青睐,但操作复杂性和可靠性问题被批评。
  • Nikon_coolpix_4300.txt

    • 数码相机: 分辨率400 万像素,适合日常摄影。
    • 镜头:尼康光学变焦镜头(约 3 倍光学变焦)。
    • 功能:自动和手动模式、紧凑设计、易于携带。
    • 目标用户:家庭用户和摄影初学者。
    • 评论概况:
      • 正面:易用、图像质量好、便携,例如“相机小巧,照片效果好” 。
      • 负面:低光拍摄效果差、电池耗电快、缺少高级手动功能。
      • 情感关键词:“excellent pictures”, “a bit heavy”, “menus are confusing”
      • 总结:因便携性和易用性受家庭用户欢迎,但在低光环境下表现一般。
  • Nokia_6610.txt

    • 上一代手机(非智能机), 彩色屏幕(128x128 像素) 2003 年左右的畅销机型及主流设计。
    • 功能:支持 SMS、MMS、FM 收音机、GPRS 网络、Java 游戏。
    • 设计:经典直板设计,简洁外观,耐用,内置天线。
    • 评论概况:
      • 正面:信号稳定、电池续航长、设计耐用,例如“电池可以用好几天” 。
      • 负面:屏幕小、功能较基础(如无蓝牙)、按键手感一般。
      • “great signal”, “buttons are too small”, “classic Nokia build”
      • 总结:因耐用性和电池续航受好评,但功能相对简单,适合基本通信需求。

2.2 product_reviews_1:商品标签及评论

5类商品评价

2.1.1 Apex_AD2600_Progressive_scan_DVD player.txt

  • 评价数量:740

  • 标签

[‘1220’, ‘1600’, ‘aff’, ‘amazon’, ‘apex’, ‘audio’, ‘audio output’, ‘auto fit’, ‘build quality’, ‘button’, ‘case’, ‘cd’, ‘cd audio disc’, ‘code’, ‘color’, ‘color signal’, ‘customer service’, ‘customer support’, ‘design’, ‘different file’, ‘direction’, ‘disc’, ‘disk’, ‘disney movie’, ‘display’, ‘divx rip’, ‘door’, ‘dvd’, ‘dvd disc’, ‘dvd media’, ‘dvd player’, ‘external display’, ‘feature’, ‘finish’, ‘format’, ‘forward’, ‘freeze’, ‘freezing’, ‘heat’, ‘jpeg’, ‘jpeg picture’, ‘jpeg slideshow’, ‘layer dvd’, ‘line support’, ‘loading’, ‘look’, ‘machine’, ‘manual’, ‘media’, ‘menu’, ‘motor’, ‘mp3’, ‘mp3 filename’, ‘mpeg’, ‘mpeg1’, ‘no disc’, ‘noise’, ‘off button’, ‘onscreen display’, ‘output’, ‘p button’, ‘panel’, ‘panel button layout’, ‘picture’, ‘picture clarity’, ‘picture quality’, ‘play’, ‘player’, ‘power supply’, ‘price’, ‘product’, ‘progressive scan’, ‘progressive scan player’, ‘quality’, ‘r’, ‘read’, ‘recognize’, ‘reliability’, ‘remote’, ‘remote button’, ‘remote control’, ‘remote layout’, ‘rewind’, ‘run’, ‘screen’, ‘screw tip’, ‘service’, ‘set up’, ‘shipping’, ‘silver plate’, ‘size’, ‘smell’, ‘sound’, ‘speed’, ‘support’, ‘svcd’, ‘sync’, ‘tech support’, ‘technical support’, ‘unit’, ‘universal remote control’, ‘usage’, ‘use’, ‘user interface’, ‘vbr mp3 cd’, ‘vcd’, ‘video’, ‘video format’, ‘video output’, ‘video quality’, ‘weight’, ‘windows media’, ‘work’, ‘zoom’, ‘zoom mode’]

  • 样本
1 repost from january 13 , 2004 with a better fit title .
2 does your apex dvd player only play dvd audio without video ?
3 or does it play audio and video but scrolling in black and white ?
4 before you try to return the player or waste hours calling apex tech support , or run the player over with your car , 
try these simple troubleshooting ideas first .
5 no picture :
...
734 however , i do n ' t know the dvd ' s performance on a heavy load of every - day viewing .
735 either way , can ' t go wrong with this price .
736 i am really impressed by this dvd player .
737 if it can fit in the drive bay , this dvd player will play it .
738 for instance , i made several back - ups of my dvd movies using dvd - r ( w ) and + r ( w ) and it plays the dvds .
739 no matter the format .
740 awesome !

2.1.2 Canon_G3.txt

  • 评价数量:597

  • 标签

[‘4mp’, ‘4mp camera’, ‘4mp resolution’, ‘auto mode’, ‘auto setting’, ‘automode’, ‘battery’, ‘battery charging system’, ‘battery life’, ‘body’, ‘button’, ‘camera’, ‘canera’, ‘canon’, ‘canon g3’, ‘canon powershot g3’, ‘casing’, ‘color’, ‘compactflash’, ‘control’, ‘darn diopter adjustment dial’, ‘delay’, ‘depth’, ‘design’, ‘dial’, ‘digital camera’, ‘digital zoom’, ‘display’, ‘distortion’, ‘download’, ‘exposure control’, ‘external flash hot shoe’, ‘feature’, ‘feel’, ‘finish’, ‘flash’, ‘flash photo’, ‘focus’, ‘four megapixel’, ‘function’, ‘g3’, ‘grain’, ‘highlight’, ‘hot shoe flash’, ‘image’, ‘image quality’, ‘import’, ‘lag’, ‘lag time’, ‘lcd’, ‘learning’, ‘learning curve’, ‘lens’, ‘lens cap’, ‘lens cover’, ‘lense’, ‘lever’, ‘light auto correction’, ‘look’, ‘low light focus’, ‘macro’, ‘made’, ‘manual’, ‘manual function’, ‘manual mode’, ‘memory card’, ‘menu’, ‘metering option’, ‘night mode’, ‘noise’, ‘off button’, ‘optic’, ‘optical zoom’, ‘option’, ‘performance’, ‘photo’, ‘photo quality’, ‘picture’, ‘picture quality’, ‘price’, ‘print’, ‘product’, ‘quality’, ‘raw format’, ‘raw image’, ‘remote’, ‘service’, ‘shape’, ‘shoot’, ‘shot’, ‘size’, ‘software’, ‘speed’, ‘spot metering’, ‘stitch picture’, ‘strap’, ‘tiff format’, ‘unresponsiveness’, ‘use’, ‘viewfinder’, ‘weight’, ‘white balance’, ‘white offset’, ‘zoom’, ‘zooming lever’]

1 i recently purchased the canon powershot g3 and am extremely satisfied with the purchase .
2 the camera is very easy to use , in fact on a recent trip this past week i was asked to take a picture of a vacationing elderly group .
3 after i took their picture with their camera , they offered to take a picture of us .
4 i just told them , press halfway , wait for the box to turn green and press the rest of the way .
...
593 even with these shortcomings , i still think it is the best digital camera available under $ 1200 .
594 definetely a great camera .
595 proven canon built quality and lens .
596 feels solid in hand .
597 rather heavy for point and shoot but a great camera for semi pros .

2.1.3 Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt

  • 评价数量:1716

  • 标签

[‘0’, ‘accessing file’, ‘accessory’, ‘affordability’, ‘alarm’, ‘appearance’, ‘audio’, ‘backlight’, ‘balance’, ‘battery’, ‘battery life’, ‘battery top’, ‘bookmakr’, ‘bookmark’, ‘break’, ‘buck’, ‘build’, ‘button’, ‘capacity’, ‘case’, ‘cd burner’, ‘cd rip’, ‘change’, ‘chinese name’, ‘click buttons’, ‘clip’, ‘clock’, ‘color’, ‘construction’, ‘control’, ‘cover’, ‘creative’, ‘creative product’, ‘customer support’, ‘customer support website’, ‘deal’, ‘delete’, ‘design’, ‘display’, ‘durability’, ‘earbud’, ‘earphone’, ‘eax’, ‘eax mode’, ‘enviromental audio’, ‘equalizer’, ‘equilizer’, ‘equipment’, ‘explorer’, ‘face plate’, ‘feature’, ‘feel’, ‘file limit’, ‘file transfer’, ‘finding’, ‘firewire’, ‘firmware’, ‘flip switch’, ‘fly wheel’, ‘flywheel’, ‘fm’, ‘fm receiver’, ‘folder’, ‘folder structure’, ‘freeze’, ‘freeze up’, ‘front cover’, ‘game’, ‘hard drive’, ‘headphone’, ‘headphone jack’, ‘id3’, ‘id3 tag’, ‘installation’, ‘instruction’, ‘interface’, ‘itunes’, ‘jog dial’, ‘lcd’, ‘leather case’, ‘leather pouch’, ‘line out jack’, ‘load’, ‘lock up’, ‘look’, ‘looking’, ‘manage’, ‘manual’, ‘mediasource’, ‘memory’, ‘menu’, ‘menue’, ‘mp3 player’, ‘music’, ‘musicmatch software’, ‘name’, ‘napster’, ‘navigation’, ‘navigation wheel’, ‘navigational system’, ‘nomad’, ‘nomad explorer’, ‘notmad’, ‘notmad software’, ‘online help’, ‘online music service’, ‘operate’, ‘option’, ‘panel’, ‘pause’, ‘pc compatibility’, ‘play’, ‘play mode’, ‘play option’, ‘playback quality’, ‘player’, ‘player hardware’, ‘playlist’, ‘plug and play’, ‘power output’, ‘price’, ‘product’, ‘program’, ‘quality’, ‘recharger’, ‘recognition’, ‘recording’, ‘remote’, ‘remove’, ‘rename’, ‘replacement battery’, ‘rip’, ‘rip cd’, ‘screen’, ‘screen saver’, ‘scroll’, ‘scroll wheel’, ‘set up’, ‘setup’, ‘shuffle’, ‘shuttle’, ‘signal to noise ratio’, ‘size’, ‘software’, ‘song speed’, ‘sorting’, ‘sound’, ‘sound option’, ‘sound quality’, ‘sound setting’, ‘stop button’, ‘stoppage’, ‘storage’, ‘storage capacity’, ‘style’, ‘support’, ‘switch’, ‘sync’, ‘tag’, ‘the unit’, ‘thing’, ‘things’, ‘this’, ‘this item’, ‘this thing’, ‘top’, ‘transfer’, ‘transfter’, ‘unit’, ‘up face’, ‘uploading’, ‘usb recharge’, ‘use’, ‘user interface’, ‘value’, ‘voice recording’, ‘volume’, ‘volume range’, ‘wake up’, ‘warranty’, ‘weight’, ‘wheel’, ‘wma file’, ‘work’, ‘xtra’, ‘zen’, ‘zen xtra’, ‘zx’]

  • 样本
1 this is an edited review , now that i have had time to use the device .
2 while , there are flaws with the machine , the xtra gets five stars because of its affordability .
3 it is the most bang - for - the - buck out there .
4 like it ' s predecessor , the quickly revised nx , this player boasts a decent size and weight , 
a relatively - intuitive navigational system that categorizes based on id3 tags ,  and excellent sound 
( widely known to be better than ipod - not surprising considering the number of years creative has been in the audio peripheral business ) .
5 the xtra improves upon the zen nx with a larger , now - blue backlit screen , which is infinitely better .
6 further , the xtra doubles the maximum filecount capacity to 16000 mp3 .
...
1712 in that model the hard drive just died one morning before my class .
1713 it ' s nothing major , just a bad hard drive , any hard drive mp3 player can have that problem .
1714 so rule of thumb , no matter what you end up buying , get the extended warranty !
1715 it always pays off .
1716 hope i ' ve been of some help .

2.1.4 Nikon_coolpix_4300.txt

  • 评价数量:346

  • 标签

[‘4mp’, ‘8mb’, ‘8mb card’, ‘accessory’, ‘audio’, ‘auto focus’, ‘auto mode’, ‘auto setting’, ‘autofocus’, ‘battery’, ‘battery life’, ‘camera’, ‘closeup mode’, ‘construction’, ‘continuous shot mode’, ‘control’, ‘customer service’, ‘delay’, ‘design’, ‘digital zoom’, ‘download’, ‘ease of use’, ‘feature’, ‘firewire’, ‘focus assist light’, ‘function’, ‘image’, ‘image download’, ‘indoor image’, ‘indoor picture’, ‘indoor shot’, ‘lcd’, ‘learn’, ‘lens cap’, ‘lense cap’, ‘macro’, ‘macro mode’, ‘manual’, ‘manual mode’, ‘memory card’, ‘menu’, ‘menu dial knob’, ‘movie’, ‘movie mode’, ‘nikon’, ‘nikon 4300’, ‘nikon support’, ‘online service’, ‘optic’, ‘optical setting’, ‘optical zoom’, ‘photo’, ‘photo quality’, ‘picture’, ‘picture quality’, ‘price’, ‘print’, ‘print quality’, ‘quality’, ‘rechargable battery’, ‘redeye’, ‘scene mode’, ‘servicing’, ‘size’, ‘software’, ‘sunset feature’, ‘system error’, ‘touchup’, ‘transfer’, ‘txt file’, ‘up shooting’, ‘use’, ‘viewfinder’, ‘weight’, ‘zoomed image’]

  • 样本
1 this camera is perfect for an enthusiastic amateur photographer .
2 the pictures are razor - sharp , even in macro .
3 it is small enough to fit easily in a coat pocket or purse .
4 it is light enough to carry around all day without bother .
...
345 the same 4mp chip from the 4500 camera , plus a 3x zoom with the ability to expand upon that with extenders , 
great closeup mode , long lasting rechargable battery , etc etc .
346 in my opinion it ' s the best camera for the money if you ' re looking for something that ' s easy to use , 
small good for travel , and provides excellent , sharp images .

2.1.5 Nokia_6610.txt

  • 评价数量:547

  • 标签

[‘application’, ‘background’, ‘backlight’, ‘battery’, ‘battery life’, ‘bluetooth’, ‘browsing’, ‘button’, ‘calendar’, ‘call’, ‘camera’, ‘color’, ‘color screen’, ‘command’, ‘construction’, ‘csr’, ‘customer rep’, ‘customer service’, ‘default ringtone’, ‘design’, ‘durability’, ‘ear’, ‘earpiece’, ‘ergonomics’, ‘feature’, ‘fm’, ‘fm radio’, ‘game’, ‘gprs’, ‘gsm’, ‘headphone jack’, ‘headset’, ‘headset jack’, ‘high speed internet’, ‘infrared’, ‘internet’, ‘key’, ‘key lock’, ‘keypad’, ‘layout’, ‘look’, ‘loud phone’, ‘memory’, ‘menu’, ‘menu option’, ‘menu options’, ‘message’, ‘mms’, ‘mobile’, ‘mobile reception’, ‘mobile service’, ‘network’, ‘nokia’, ‘operate’, ‘pc cable’, ‘pc suite’, ‘pc sync’, ‘phone’, ‘phone book’, ‘phone performance’, ‘picture’, ‘picture sharing’, ‘pim’, ‘plan’, ‘quality’, ‘radio’, ‘rate plan’, ‘reception’, ‘resolution’, ‘ring’, ‘ring tone’, ‘ringer’, ‘ringing tone’, ‘ringtone’, ‘screen’, ‘screensaver’, ‘service’, ‘signal’, ‘signal quality’, ‘size’, ‘software’, ‘sound’, ‘sound quality’, ‘sound volume’, ‘speaker’, ‘speaker phone’, ‘speakerphone’, ‘sprint’, ‘sprint customer service’, ‘sprint plan’, ‘sturdy’, ‘t customer service’, ‘tone’, ‘tune’, ‘use’, ‘user interface’, ‘vibrate setting’, ‘vibration’, ‘voice’, ‘voice dialing’, ‘voice quality’, ‘volume’, ‘volume control’, ‘volume key’, ‘wallpaper’, ‘warranty’, ‘web’, ‘weight’, ‘wireless telephone’, ‘work’, ‘zone’]

  • 样本
1 i am a business user who heavily depend on mobile service .
2 there is much which has been said in other reviews about the features of this phone , 
it is a great phone , mine worked without any problems right out of the box .
3 just double check with customer service to ensure the number provided by amazon is for the city / exchange you wanted .
4 after several years of torture in the hands of at & t customer service i am delighted to drop them , 
and look forward to august 2004 when i will convert our other 3 family - phones from at & t to t - mobile !
...
544 it is crystal clear .
545 this is one of the nicest phones nokia has made .
546 i do recommend getting the data kit for those geeks .
547 there are a lot of cool websites with games and midi ringtones to download for free .

3.句法分析语料

带词性标注的语料,适用于词性标注训练/测试。

3.1 Brown

新闻Brown Corpus有该类型标注,是一种较为简化的版本,具体Tagset如下:

标签含义
ATArticle(冠词)
NNNoun(名词)
JJAdjective(形容词)
VBDVerb, past tense
NP-TLProper noun in title (专有名词)
NN-TLNoun in Title (标题名词)
  • 测试代码
from nltk.corpus import brown
tagged_sent = brown.tagged_sents()[0] # 获取标注好的句子(Brown tagset)
print(tagged_sent[:10])
  • 输出标注内容

[(‘The’, ‘AT’), (‘Fulton’, ‘NP-TL’), (‘County’, ‘NN-TL’), (‘Grand’, ‘JJ-TL’), (‘Jury’, ‘NN-TL’), (‘said’, ‘VBD’), (‘Friday’, ‘NR’), (‘an’, ‘AT’), (‘investigation’, ‘NN’), (‘of’, ‘IN’)]

3.2 treebank:宾州树库(句法分析)

Treebank 语料库主要基于《华尔街日报》(Wall Street Journal, WSJ)的文章,文件以 wsj_XXXX.mrg 命名, 共200个文件。

  • 代码测试文件名
from nltk.corpus import treebank
file_ids = treebank.fileids()

输出文件名:

[‘wsj_0001.mrg’, ‘wsj_0002.mrg’, ‘wsj_0003.mrg’, ‘wsj_0004.mrg’, ‘wsj_0005.mrg’, …, ‘wsj_0199.mrg’]

  • Treebank 语料库包含:

  • 词性标注(POS tagged):词和对应的 POS 标签。

  • 句法树(Parsed sentences):句子的句法结构,表示为树形结构。

  • 原始文本:未标注的句子。

每个文件包含多个句子的标注。Treebank 语料库约 100万词,句法树可能包含嵌套结构,处理时需要熟悉递归或树遍历方法。

3.2.1 词性标注

nltk 默认采用 Penn Treebank 词性标注(POS Tags):

在 Brown 中,冠词标记是 AT;而在 Penn Treebank 中,它们被统一划为 DT(Determiner)。

标记(Tag)含义 英文(中文)示例
CCCoordinating conjunction (并列连词)and, but, or, yet
CDCardinal number (基数) 与 Cardinal Number(序数)one, two, 1999
DTDeterminer (限定词)the, a, an
EXExistential there (存在词)there (There is …)
FWForeign word (外来词)c’est, etc., esprit
INPreposition or subordinating conj. (介词/从属连词)in, of, like, although
JJAdjective (形容词)big, beautiful, green
JJRAdjective, comparative (比较级形容词)bigger, smaller
JJSAdjective, superlative (最高级形容词)biggest, smallest
LSList item marker (列表项标记)1), a), B.
MDModal (情态动词)can, will, should
NNNoun, singular (单数名词)dog, year
NNSNoun, plural (复数名词)dogs, years
NNPProper noun, singular (专有名词,单数)John, London
NNPSProper noun, plural (专有名词,复数)Smiths, Americans
PDTPredeterminer (前限定词)all, both
POSPossessive ending (所有格结尾)’s, ’
PRPPersonal pronoun (人称代词)I, you, he, she, it
PRP$Possessive pronoun (物主代词)my, your, his, her
RBAdverb (副词)quickly, very, well
RBRAdverb, comparative (比较级副词)better, faster
RBSAdverb, superlative (最高级副词)best, fastest
RPParticle (小品词)up, off, out
SYMSymbol (符号)$, %, +
TOto (to词)to
UHInterjection (感叹词)oh, wow, oops
VBVerb, base form (动词原形)run, eat, be
VBDVerb, past tense (过去式)ran, ate, was
VBGVerb, gerund/present participle (现在分词)running, being
VBNVerb, past participle (过去分词)eaten, been
VBPVerb, non-3rd person present (现在式)run, eat (除 he/she/it 外)
VBZVerb, 3rd person present (现在三单)runs, eats, is
WDTWh-determiner (限定名词,出现在名词前,相当于形容词性用法)which, that
WPWh-pronoun (代替人/事物,直接作为主语或宾语使用)who, what
WP$Possessive wh-pronoun (物主代词 表示所属关系,限定名词)whose
WRBWh-adverb (副词,修饰整个句子,询问地点、时间、原因、方式。)where, when, why

其中符号的标注多为符号本身:

符号词性标记说明
..句号
,,逗号
::冒号、分号、破折号
''''右引号
( )-LRB--RRB-左右括号
...:(罕见)若是省略号,会显示不同的词素
  • 代码测试
tagged_words = treebank.tagged_words() #-------------------- 获取所有带 POS 标签的词 -----------------------print(tagged_words[:10]) # 示例:打印前 10 个带 POS 标签的词tagged_words_file = treebank.tagged_words(fileids='wsj_0002.mrg')
print(tagged_words_file[:10]) # 获取特定文件的带 POS 标签的词(例如 wsj_0001.mrg)

[(‘Pierre’, ‘NNP’), (‘Vinken’, ‘NNP’), (‘,’, ‘,’), (‘61’, ‘CD’), (‘years’, ‘NNS’), (‘old’, ‘JJ’), (‘,’, ‘,’), (‘will’, ‘MD’), (‘join’, ‘VB’), (‘the’, ‘DT’)]

[(‘Rudolph’, ‘NNP’), (‘Agnew’, ‘NNP’), (‘,’, ‘,’), (‘55’, ‘CD’), (‘years’, ‘NNS’), (‘old’, ‘JJ’), (‘and’, ‘CC’), (‘former’, ‘JJ’), (‘chairman’, ‘NN’), (‘of’, ‘IN’)]

3.2.2 句法树

Treebank 的句法树是其核心内容,存储为 nltk.tree.Tree 对象。

# 获取所有句法树
parsed_sents = treebank.parsed_sents()# 示例:打印第一个句法树
print(parsed_sents[0])# 或者以树形结构可视化(需要安装 graphviz 和 python-graphviz)
parsed_sents[0].draw()  # 弹出图形界面显示树# 获取特定文件的句法树
parsed_sents_file = treebank.parsed_sents(fileids='wsj_0001.mrg')
print(parsed_sents_file[0])# 提取纯文本句子(词序列)
tree = parsed_sents[0]
words = tree.leaves()
sentence = ' '.join(words)
print(sentence)
  • 不带标注的句子如下:

‘Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .’

  • 输出句法树的图像界面如下:

在这里插入图片描述

  • 输出句法树的命令行如下:
(S(NP-SBJ(NP (NNP Pierre) (NNP Vinken))(, ,)(ADJP (NP (CD 61) (NNS years)) (JJ old))(, ,))(VP(MD will)(VP(VB join)(NP (DT the) (NN board))(PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))(NP-TMP (NNP Nov.) (CD 29))))(. .))
  • 结构解析
 	1. (S …) — 整个句子(Sentence)2. (NP-SBJ …) — 主语短语(Subject Noun Phrase)2.1. (NNP Pierre) (NNP Vinken):人名,专有名词2.2 这是一个带插入语的主语结构,逗号表示插入语边界2.3.ADJP:形容词短语 “61 years old”2.4 (CD 61) (NNS years):数词 + 名词,形成名词短语2.5 (JJ old):形容词 “old”3. (VP …) — 谓语短语(Verb Phrase)3.3 (MD will):情态动词 will3.4 (VB join):动词原形 join3.5 (NP the board):直接宾语3.6 (PP-CLR as a nonexecutive director):清除性介词短语(表示角色),“作为一名非执行董事”3.7 (NP-TMP Nov. 29):临时性时间状语,“11月29日”4. . (. .) — 句号

3.2.3 原始文本

  • 代码测试句子

sents = treebank.sents() #所有句子

sents_file = treebank.sents(fileids=‘wsj_0001.mrg’) # 获取特定文件的句子

输出

[‘Pierre’, ‘Vinken’, ‘,’, ‘61’, ‘years’, ‘old’, ‘,’, ‘will’, ‘join’, ‘the’, ‘board’, ‘as’, ‘a’, ‘nonexecutive’, ‘director’, ‘Nov.’, ‘29’, ‘.’]

  • 统计所有句子pos频率代码
from collections import Counter # 统计所有 POS 标签的频率tags = [tag for word, tag in treebank.tagged_words()]
tag_freq = Counter(tags)
print(tag_freq.most_common(10))  # 打印最常见的 10 个标签

输出

[(‘NN’, 13166), (‘IN’, 9857), (‘NNP’, 9410), (‘DT’, 8165), (‘-NONE-’, 6592),
(‘NNS’, 6047), (‘JJ’, 5834), (‘,’, 4886), (‘.’, 3874), (‘CD’, 3546)]

即单数名词第一多,其次是介词。

3.3 CoNLL2000

这个数据集来自 2000 年的 CoNLL(Computational Natural Language Learning),用于浅层句法分析(shallow chunkers),包括基于规则的分析器,基于机器学习的分类器(如 CRF、MaxEnt、BiLSTM-CRF)或词组结构识别(chunking)任务。

数据来自Penn Treebank 的 Wall Street Journal (WSJ) 语料部分转换而来

  • 标注 Chunk 标签(句法结构块)

  • 每一个句子是一个 nltk.Tree 对象

分块标签标识句子的短语结构。其中每个子节点是一个表示词块(chunk,如 NP, VP)的子树, 相对于treebank,CoNLL-2000 并不标注完整从句结构, 它只标注扁平的短语块。

因此,IN 类型的从属连词(如 if, when, that),CC 并列连词(如 and, but)副词RB等并不构成单独的 PP(介词短语),也不属于 NP、VP,会被单独列出来(chunk 外部)

3.3.1 三元组标签

三元组包含三个部分:词(word)、词性标签(POS tag)和分块标签(chunk tag)。

  • POS 标签(Part-of-Speech Tags)
    • 定义:POS 标签表示单词的语法类别,例如名词、动词、形容词等。
    • 标签集:在 CoNLL-2000 语料库(你的示例数据来自此语料库)中,POS 标签基于 Penn Treebank 标签集,例如:
      • NN:名词(单数)
      • IN:介词
      • DT:限定词
      • VBZ:动词(第三人称单数)
      • RB:副词
      • VBN:动词(过去分词)
      • TO:to(作为介词或不定式标记)
      • VB:动词(原形)

作用:POS 标签描述单词的句法角色,独立于句子结构。

  • 分块标签(Chunk Tags)

    • 定义:分块标签表示单词所属的短语块(chunk),例如名词短语(NP)、动词短语(VP)、介词短语(PP)等。它们使用 IOB 格式 来标记短语的边界。
    • IOB 格式:
      • B-XXX:表示短语块的开始(Beginning),XXX 是短语类型(如 NP、 VP、 PP)。
      • I-XXX:表示短语块的内部(Inside),即短语的后续词。
      • O:表示单词不属于任何短语块 。
  • 常见分块类型

    • NP:名词短语(Noun Phrase),如“the pound”。
    • VP:动词短语(Verb Phrase),如“is widely expected”.
    • PP:介词短语(Prepositional Phrase),如“in the pound”.

示例:

[('Confidence', 'NN', 'B-NP'), ('in', 'IN', 'B-PP'), ('the', 'DT', 'B-NP'),
('pound', 'NN', 'I-NP'), ('is', 'VBZ', 'B-VP'), ('widely', 'RB', 'I-VP'), 
('expected', 'VBN', 'I-VP'), ('to', 'TO', 'I-VP'), ('take', 'VB', 'I-VP'), 
('another', 'DT', 'B-NP')]

示例分析:

  • (‘Confidence’, ‘NN’, ‘B-NP’):B-NP 表示“Confidence”是名词短语的开始。
  • (‘in’, ‘IN’, ‘B-PP’):B-PP 表示“in”是介词短语的开始。
  • (‘pound’, ‘NN’, ‘I-NP’):I-NP 表示“pound”是名词短语的内部词(属于前面的“the”开始的 NP)。
  • (‘is’, ‘VBZ’, ‘B-VP’):B-VP 表示“is”是动词短语的开始。

3.3.2 浅层结构划分

这里没有嵌套更深的从句结构等, 结构类似于句法树, 测试句子为:

Sentence 1: 
['Confidence', 'in', 'the', 'pound', 'is', 'widely', 'expected', 
'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 
'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',', 
'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', "'s", 'near-record', 'deficits', '.']

浅层结构树为:

(S(NP Confidence/NN)(PP in/IN)(NP the/DT pound/NN)(VP is/VBZ widely/RB expected/VBN to/TO take/VB)(NP another/DT sharp/JJ dive/NN)if/IN(NP trade/NN figures/NNS)(PP for/IN)(NP September/NNP),/,due/JJ(PP for/IN)(NP release/NN)(NP tomorrow/NN),/,(VP fail/VB to/TO show/VB)(NP a/DT substantial/JJ improvement/NN)(PP from/IN)(NP July/NNP and/CC August/NNP)(NP 's/POS near-record/JJ deficits/NNS)./.)

或:

 (S(NP Rockwell/NNP International/NNP Corp./NNP)(NP 's/POS Tulsa/NNP unit/NN)(VP said/VBD)(NP it/PRP)(VP signed/VBD)(NP a/DT tentative/JJ agreement/NN)(VP extending/VBG)(NP its/PRP$ contract/NN)(PP with/IN)(NP Boeing/NNP Co./NNP)(VP to/TO provide/VB)(NP structural/JJ parts/NNS)(PP for/IN)(NP Boeing/NNP)(NP 's/POS 747/CD jetliners/NNS)./.)

参照格式:

(Chunk Word/POS)

  • 代码
from nltk.corpus import conll2000# 加载数据
train_data = conll2000.chunked_sents('train.txt')
test_data = conll2000.chunked_sents('test.txt')print(train_data[0])

3.4 三个语料的词频统计

top 5 tags

  • Brown

[(‘NN’, 152470), (‘IN’, 120557), (‘AT’, 97959), (‘JJ’, 64028), (‘.’, 60638)]

  • Treebank top

[(‘NN’, 13166), (‘IN’, 9857), (‘NNP’, 9410), (‘DT’, 8165), (‘-NONE-’, 6592)]

  • CoNLL-2000

[(‘NN’, 36789), (‘IN’, 27835), (‘NNP’, 24690), (‘DT’, 22355), (‘NNS’, 16653)]

4.教程代码

整个教程用到的代码放在该文件夹:

  • https://github.com/disanda/d_code/tree/master/4.nltk

相关文章:

NLTK库: 数据集3-分类与标注语料(Categorized and Tagged Corpora)

NLTK库: 数据集3-分类与标注语料(Categorized and Tagged Corpora) 1.二分类语料 主要是电影语料,和情绪(积极消极、主观客观)有关,有以下2个语料: 1.1 movie_reviews: IMDb 影评 IMDb(Internet Movie …...

物理:人的记忆是由基本粒子构成的吗?

问题: 基因属于人体的一部分,记忆也是人体的一部分,那么为什么基因可以代际遗传,但是记忆却被清空重置。如果基因是由粒子构成,那么记忆是不是也应该由粒子构成?如果记忆是粒子构成的,那么能否说明记忆永恒,即使死亡了身体被分解了,那么只要保证其身体有关的所有粒子被…...

加速度策略思路

一种基于技术指标和动态止损策略的交易方法,旨在提高交易的灵活性和风险控制能力。 1 -动态止损价格计算:该函数通过计算ATR(平均真实范围)和盈利峰值价,结合加速系数,动态调整止损价格。具体来说&#xf…...

【计算机组成原理】第二部分 存储器--分类、层次结构

文章目录 分类&层次结构0x01 分类按存储介质分类按存取方式分类按在计算机中的作用分类 0x02 层次结构 分类&层次结构 0x01 分类 按存储介质分类 半导体存储器磁表面存储器磁芯存储器光盘存储器 按存取方式分类 存取时间与物理地址无关(随机访问&#…...

Spring AI 开发本地deepseek对话快速上手笔记

Spring AI Spring AI是一个旨在推进生成式人工智能应用程序发展的项目,Spring AI的核心目标是提供高度抽象化的组件,作为开发AI应用程序的基础,使得开发者能够以最少的代码改动便捷地交换和优化功能模块‌ 在开发之前先得引入大模型&#xf…...

Python训练打卡Day23

机器学习管道 pipeline 基础概念 pipeline在机器学习领域可以翻译为“管道”,也可以翻译为“流水线”,是机器学习中一个重要的概念。 在机器学习中,通常会按照一定的顺序对数据进行预处理、特征提取、模型训练和模型评估等步骤,以…...

【每天一个知识点】Dip 检验(Dip test)

Dip 检验(Dip test)是一种用于检验一维数据分布是否为单峰(unimodal)的非参数统计方法。该检验由 Hartigan 和 Hartigan 于 1985 年提出,通常用于探索性数据分析中,以判断数据是否仅具有一个峰值结构&#…...

AbstractQueuedSynchronizer之AQS

一、前置知识 公平锁和非公平锁: 公平锁:锁被释放以后,先申请的线程先得到锁。性能较差一些,因为公平锁为了保证时间上的绝对顺序,上下文切换更频繁 非公平锁:锁被释放以后,后申…...

【Qt】pro工程文件转CMakeLists文件

1、简述 Qt6以后默认使用cmake来管理工程,之前已经一直习惯使用pro,pro的语法确实很简单、方便。 很多项目都是cmake来管理,将它们加入到Qt项目中,cmake确实是大势所趋。比如,最近将要开发的ROS项目,也是使用的cmake语法。 以前总结的一些Qt代码,已经编写成pro、pri等…...

docker-compose部署thingsboard/tb-cassandra

1、配置 阿里云服务器2H8G 最低 系统:Ubuntu20.0.4 安装 docker 和 docker-compose 环境 ====================安装docker====================== # 更新包 sudo apt update# 安装docker sudo apt install docker.io# 查看是否安装成功 docker --version==================…...

MySQL 日期计算方法 date_sub()、date_add()、datediff() 详解-文中有示例帮助理解

1、date_sub()、date_add() date_sub() 和date_add() 语法相同,只不过一个加一个减。 从日期中减去指定时间间隔 语法: DATE_SUB(start_date, INTERVAL expr unit) start_date: 起始日期(如 now() , 字段名)。 INTERVAL expr…...

GPT-4.1和GPT-4.1-mini系列模型支持微调功能,助力企业级智能应用深度契合业务需求

微软继不久前发布GPT-4.1系列模型后,Azure OpenAI服务(国际版)现已正式开放对GPT-4.1和GPT-4.1-mini的微调功能,并通过Azure AI Foundry(国际版)提供完整的部署和管理解决方案。这一重大升级标志着企业级AI…...

如何将两台虚拟机进行搭桥

虚拟机网络搭桥配置指南 要实现两台虚拟机之间的网络互通("搭桥"),需要根据您的虚拟化平台选择合适的网络模式。以下是主流虚拟化软件的配置方法: 一、VMware 虚拟机互通配置 方案 1:使用桥接模式&#x…...

无缝对接主流电商平台接口,解决货源难题

行业调查显示,大多数代购商每天要花费数小时在淘宝、1688等平台寻找合适商品。手动复制商品链接、整理信息不仅耗时耗力,还容易出错——价格标错、库存不准等问题时有发生,直接影响客户体验。更麻烦的是,不同平台的商品信息格式不…...

GZip+Base64压缩字符串在ios上解压报错问题解决(安卓、PC模拟器正常)

java这边的压缩代码 引入的是java8 jdk自带的gzip压缩( java.util.zip.GZIPOutputStream)、BASE64Encoder( sun.misc.BASE64Encoder) public static String compress(String str) {if (str ! null && str.length() ! 0) {ByteArrayOutputStream…...

Cookie、 Local Storage、 Session Storage三种客户端存储方式

存储特性对比表 特性CookieLocal StorageSession Storage生命周期可设置过期时间永久保存会话结束自动清除存储容量4KB左右5-10MB5-10MB自动发送到服务器每次HTTP请求头携带不发送不发送访问方式服务端/客户端均可读写仅客户端仅客户端 使用场景及示例 1. Cookie - 用户身份…...

进程等待简单讲解

1. 基本概念 1.1 进程终止与退出状态 当一个进程终止时,它会向其父进程发送一个信号(通常是SIGCHLD),并保存退出状态(exit status)。退出状态可以是一个正常终止的返回值,也可以是一个信号导致…...

基于大模型预测胸椎管狭窄诊疗全流程的研究报告

目录 一、引言 1.1 研究背景与意义 1.2 研究目的与创新点 1.3 研究方法与数据来源 二、胸椎管狭窄症概述 2.1 疾病定义与分类 2.2 病因与发病机制 2.3 流行病学特征 三、大模型技术原理与应用现状 3.1 大模型基本原理 3.2 在医疗领域的应用案例 3.3 用于胸椎管狭窄…...

Oracle OCP认证考试考点详解083系列15

题记: 本系列主要讲解Oracle OCP认证考试考点(题目),适用于19C/21C,跟着学OCP考试必过。 71. 第71题: 题目 解析及答案: 关于在 Oracle 18c 及更高版本中基于 Oracle 黄金镜像的安装,以下哪…...

【老飞飞源码】新版高清飞飞源码+数据库+客户端+服务器端完整文件打包

【老飞飞源码】新版高清飞飞源码数据库客户端服务器端完整文件打包下载 编译环境 vs2022 搭建环境 sql2022 测试运行环境 windows 11 本地测试生成搭建都成功 功能包含: pvp排行榜 宠物特效 箱子预览系统 vip系统 宝箱系统 内挂系统 离线摆摊系统 特效帽子系…...

Maven 动态插件配置:Profile的灵活集成实践

🧑 博主简介:CSDN博客专家,历代文学网(PC端可以访问:https://literature.sinhy.com/#/?__c1000,移动端可微信小程序搜索“历代文学”)总架构师,15年工作经验,精通Java编…...

Python爬虫如何应对网站的反爬加密策略?

在当今的互联网环境中,网络爬虫已经成为数据采集的重要工具之一。然而,随着网站安全意识的不断提高,反爬虫技术也越来越复杂,尤其是数据加密策略的广泛应用,给爬虫开发者带来了巨大的挑战。本文将详细介绍Python爬虫如…...

STM32H743输出50%的占空比波形

使用cubeMX进行配置如下: 时钟配置如下: 具体代码如下: /* USER CODE BEGIN Header */ /********************************************************************************* file : main.c* brief : Main program b…...

ios remote debut proxy 怎么开启手机端调试和inspect

手机开启远程调试教程(适用于 Chrome / Safari) 前端移动端调试指南|适用 iPhone 和 Android|WebDebugX 出品 本教程将详细介绍如何在 iPhone 和 Android 手机上开启网页检查器,配合 WebDebugX 实现远程调试。教程包含…...

GraspVLA:基于Billion-级合成动作数据预训练的抓取基础模型

25年5月来自银河通用(Galbot)、北大、港大和 BAAI 的论文“GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data”。 具身基础模型因其零样本泛化能力、可扩展性以及通过少量后训练即可适应新任务的优势&#x…...

BGP联邦实验

一.需求 1.AS1存在两个环回,一个地址为192.168.1.0/24,该地址不能再任何协议中宣告 AS3存在两个环回,一个地址为192.168.2.0/24,该地址不能再任何协议中宣告 AS1还有一个环回地址为10.1.1.0/24,AS3另一个环回地址是…...

自动化测试基础知识详解

🍅 点击文末小卡片,免费获取软件测试全套资料,资料在手,涨薪更快 自动化测试是指利用自动化工具和脚本,模拟人工操作进行软件测试的过程。它在软件开发中扮演着非常重要的角色,可以提高测试效率、降低成本…...

Java后端快速生成验证码

Hutool是一个小而全的Java工具类库,它提供了很多实用的工具类,包括但不限于日期处理、加密解密、文件操作、反射操作、HTTP客户端等。 核心工具类:CaptchaUtil,CaptchaUtil 是 Hutool 提供的一个工具类,用于创建各种类…...

【愚公系列】《Manus极简入门》036-物联网系统架构师:“万物互联师”

🌟【技术大咖愚公搬代码:全栈专家的成长之路,你关注的宝藏博主在这里!】🌟 📣开发者圈持续输出高质量干货的"愚公精神"践行者——全网百万开发者都在追更的顶级技术博主! &#x1f…...

主流高防服务器技术对比与AI防御方案实战

1. 高防服务器核心能力对比 当前市场主流高防服务商(如阿里云、腾讯云、华为云)的核心防御能力集中在流量清洗与静态规则防护,但面临以下挑战: 静态防御瓶颈:传统方案依赖预定义规则,对新型攻击&#xff…...

带格式的可配置文案展示

方案一(格式包含颜色换行等) 服务端:配置后接口输出带标签的字符串,但是尖括号不能被转义前端:v-html接受字符串(vue项目),原生用innerHTML赋值 方案二(格式针对只存在…...

湖南大学3D场景问答最新综述!3D-SQA:3D场景问答助力具身智能场景理解

作者: Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, Naveed Akhtar 单位:湖南大学,墨尔本大学,悉尼大学,安徽大学 论文标题:Embodied Intelligence for 3D Understanding: A Survey on 3D Sce…...

【PyTorch】深度学习实践——第二章:线性模型

参考:刘二老师的《PyTorch深度学习实践》完结合集 本章实现了一个简单的线性回归模型,用于学习输入x和输出y之间的线性关系(yw*x)。 一、代码细节 1.数据准备 x_data [1.0, 2.0, 3.0] y_data [2.0, 4.0, 6.0]定义了训练数据,x和y之间显然…...

【Python 中文编码】

在 Python 中处理中文编码问题时,需重点关注文件编码声明、字符串编码转换及环境配置。以下是分步指南和最佳实践: 一、Python 3 的默认编码行为 Python 3.x:默认使用 UTF-8 编码(与 Python 2.x 的 ASCII 默认编码不同&#xff0…...

Excel宏和VBA

Excel宏和VBA(Visual Basic for Applications)是自动化Excel操作的强大工具,可帮助用户批量处理数据、自定义功能、提升效率。以下是详细使用方法及示例: --- ### **一、基础操作** #### 1. **录制宏** - **步骤**&#xff1…...

1688 API 接口使用限制

在使用 1688 API 接口时,需要注意以下几方面的限制和注意事项,以确保合规使用并避免不必要的问题。 一、调用频率限制 1688 平台对 API 接口的调用频率通常有限制,以防止滥用和对服务器造成过大压力。具体限制如下: 免费版&…...

5. 动画/过渡模块 - 交互式仪表盘

5. 动画/过渡模块 - 交互式仪表盘 案例&#xff1a;数据分析仪表盘 <!DOCTYPE html> <html><head><meta charset"utf-8"><title></title></head><style type"text/css">.dashboard {font-family: Arial…...

数据擦除标准:1-Pass vs. 3-Pass vs. 7-Pass有什么区别,哪个更好?

虽然像美国国防部(DoD)5220.22-M这样的旧标准提倡多次覆盖,但像NIST 800-88和新兴的IEEE 2883标准这样的新指南已经改变了对数据擦除效果的看法。在这篇博客中,我们解释了不同的擦除方法,并分析了旧标准在新时代是否仍然相关。 理解数据擦除方法 数据擦除包括用0、1或随…...

MySQL推荐书单:从入门到精通

给大家介绍一些 MySQL 从入门到精通的经典书单&#xff0c;可以基于不同学习阶段的需求进行选择。 入门 MySQL必知必会 这本书继承了《SQL必知必会》的优点&#xff0c;专门针对 MySQL 用户&#xff0c;没有过多阐述数据库基础理论&#xff0c;而是紧贴实战&#xff0c;直接从…...

Rodrigues旋转公式-绕任意轴旋转

Rodrigues旋转公式 给定旋转轴单位向量 k ( k x , k y , k z ) \mathbf{k}(k_x,k_y,k_z) k(kx​,ky​,kz​)和旋转角度 θ \theta θ&#xff0c;旋转矩阵 R R R可以表示为&#xff1a; R I sin ⁡ θ K ( 1 − cos ⁡ θ ) K 2 RI\sin \theta K(1-\cos \theta)K^2 RIsin…...

【大模型面试每日一题】Day 17:解释MoE(Mixture of Experts)架构如何实现模型稀疏性,并分析其训练难点

【大模型面试每日一题】Day 17&#xff1a;解释MoE&#xff08;Mixture of Experts&#xff09;架构如何实现模型稀疏性&#xff0c;并分析其训练难点 &#x1f4cc; 题目重现 &#x1f31f;&#x1f31f; 面试官:解释MoE&#xff08;Mixture of Experts&#xff09;架构如何…...

Datawhale 5月coze-ai-assistant 笔记1

课程地址&#xff1a; coze-ai-assistant-课程摘要 | Datawhalehttps://www.datawhale.cn/learn/summary/105 动手实践 链接&#xff1a;https://www.coze.cn/home 作业&#xff1a;智能体链接地址扣子扣子是新一代 AI 大模型智能体开发平台。整合了插件、长短期记忆、工作…...

2025.5.13总结

想要成为自己想要成为的那个人&#xff0c;并不是一件容易的事情。在我报口才课的时候&#xff0c;老师一针见血的指出了我的不足。因为不敢&#xff0c;所以不做&#xff0c;因为不去做&#xff0c;所以不会&#xff0c;而正因为不会&#xff0c;也导致了你不敢。当我听到这个…...

spring中的@Async注解详解

一、核心功能与作用 Async 是Spring框架提供的异步方法执行注解&#xff0c;用于将方法标记为异步任务&#xff0c;使其在独立线程中执行&#xff0c;从而提升应用的响应速度和吞吐量。其主要作用包括&#xff1a; 非阻塞调用&#xff1a;主线程调用被标记方法后立即返回&…...

计算机视觉----时域频域在图像中的意义、傅里叶变换在图像中的应用、卷积核的频域解释

1、时域&#xff08;时间域&#xff09;——自变量是时间,即横轴是时间,纵轴是信号的变化。其动态信号x&#xff08;t&#xff09;是描述信号在不同时刻取值的函数。 2、频域&#xff08;频率域&#xff09;——自变量是频率,即横轴是频率,纵轴是该频率信号的幅度,也就是通常说…...

分布式链路跟踪

目录 链路追踪简介 基本概念 基于代理&#xff08;Agent&#xff09;的链路跟踪 基于 SDK 的链路跟踪 基于日志的链路跟踪 SkyWalking Sleuth ZipKin 链路追踪简介 分布式链路追踪是一种监控和分析分布式系统中请求流动的方法。它能够记录和分析一个请求在系统中经历的每…...

从数据中台到数据飞轮:实现数据驱动的升级之路

从数据中台到数据飞轮&#xff1a;实现数据驱动的升级之路 随着数字化转型的推进&#xff0c;数据已经成为企业最重要的资产之一&#xff0c;企业普遍搭建了数据中台&#xff0c;用于整合、管理和共享数据&#xff1b;然而&#xff0c;近年来&#xff0c;数据中台的风潮逐渐减退…...

深入解析Java序列化:从使用到原理

在此之前&#xff0c;对于 Java 中的序列化&#xff0c;我一直停留在使用层面 —— 把需要序列化在网络上传输的类实现Serializable接口就可以了 但对于这块知识点&#xff0c;随着工作年限的提升&#xff0c;我觉得必须要好好研究下它了&#xff0c;不能似懂非懂的只知道使用。…...

Python面向对象编程(OOP)深度解析:从封装到继承的多维度实践

引言 面向对象编程(Object-Oriented Programming, OOP)是Python开发中的核心范式&#xff0c;其三大特性——​​封装、继承、多态​​——为构建模块化、可维护的代码提供了坚实基础。本文将通过代码实例与理论结合的方式&#xff0c;系统解析Python OOP的实现机制与高级特性…...

传输层:UDP协议

1.UDP协议特点 2.UDP报文格式 如下&#xff1a; 校验和的计算&#xff1a; 3.例子 UDP&#xff08;User Datagram Protocol&#xff0c;用户数据报协议&#xff09;是一种无连接的传输层协议&#xff0c;其报文格式简单高效&#xff0c;适用于对实时性要求高但允许少量丢包的…...